How do I get ngrams for all combinations of words in a sentence?

Lets say I have a sentence I need multiple ngrams. If I create bigrams using Tf idf vectorizer it will create bigrams only using consecutive words. i.e. I will get I need, need multiple, multiple ngrams.

How can I get I mutiple, I ngrams, need ngrams?

Topic ngrams tfidf nlp machine-learning

Category Data Science


You can use this code as well:

s = "I need multiple ngrams"
tokens = s.split(' ')

res = [(tokens[i],tokens[j]) for i in range(len(tokens) -1) for j in range(i+1, len(tokens))]

Output:

[('I', 'need'), ('I', 'multiple'), ('I', 'ngrams'), ('need', 'multiple'), ('need', 'ngrams'), ('multiple', 'ngrams')]

You can use itertools.combinations().

For example:

s = "I need multiple ngrams"
s_list = s.split(" ") # Assumes you tokenize with white space.

import itertools

combinations = list(itertools.combinations(s_list, 2)) # the second argument ("2" in this case) is the size of the n-gram.

You will get the following output:

[('I', 'need'), ('I', 'multiple'), ('I', 'ngrams'), ('need', 'multiple'), ('need', 'ngrams'), ('multiple', 'ngrams')]

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.