Suggestion for a better way to organize data to generate frequent item-sets?

I have a data of a bag of words in a document. The data has 3 columns: {document number, word number, count of the word in the number}. I am supposed to generate frequent item-sets of a particular size.

I thought that I would make list of all words that appear in a document, create a table of this list, and then generate frequent item-sets using Mlxtend or Orange . However, this approach does not seem to be efficient.

Topic orange3 orange text-mining data-mining

Category Data Science


If the size is reasonable (i.e. not too many documents and not too many words in a document), you could try to build a map for each possible itemset, for instance like this:

// Assuming data is an array of size N containing all the documents
// clusters is a map associating each itemset with a set of documents
for i=0 to N-1
  for j=i+1 to N-1
    group = overlap(data[i], data[j])
    add data[i] to the set clusters[group]
    add data[j] to the set clusters[group]

An alternative version if the number of different values and size of the sets allow it and/or if it's possible to precompute the itemsets of interest:

for i=0 to N-1
  for every subset S of data[i]
    add data[i] to the set clusters[S] 

(adapted from https://datascience.stackexchange.com/a/60609/64377)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.