How would you count top repetitions in a stream (grouping and summing similar strings too)

I would look in an arbitrary window (already relative so unbased), count each element (repetitions) then rank them. Continuously look, group and sum similar strings.

There are two options:

Keep a limited bag of results

The logical problem that appears to me, in a limited bag of ranked strings, say top 10, is if I remove all keywords with no repetitions, or those with fewer repetitions, I would lose track on them and would re-count them from zero.

Unlimited bag of results

Seems easier because determined, but technically this would blow memory. (I assume, I'm still not sure though about the bandwidth)

Any hints ?

Context

As you can imagine, I want to add a section in my website with popular searches.

For instance:

  1. The lookup window would be: One month
  2. The bag of top searches would be of length 10
  3. I could keep a stack of say 1 million distinct Strings to run similarity and reevaluate counts each time (on each new element or for yet another time window).

Finally, calculating top searches is not as easy as it seems !!!

Topic data-stream-mining

Category Data Science


I'm answering my question. This is just a suggestion of course, I assume same problem could be approached differently.

First I discovered Bloom Filters which is a probabilistic hash, very useful for endless streams (not to blow memory), to be able to discover repetitions.

One particular algorithm based on Bloom Filters is topK

Practically, as I'm running on node I came up with the following solution with some other constraints (refresh the learning bag when half the queue is seen already). Also, purge elements earlier than one month on insertions (it could be triggered otherwise).

Also I used a circular queue for the one month only scan.

const { TopK } = require('bloom-filters')
const FuzzySet = require('fuzzyset')


var adt_1 = require("@toreda/adt")

// generate random strings 
function makeid(length) {
  var result = '';
  var characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  var charactersLength = characters.length;
  for (var i = 0; i < length; i++) {
    result += characters.charAt(Math.floor(Math.random() *
      charactersLength));
  }
  return result;
}
// arbitrary choice
const maxSize = 3000
var circularBuffer = new adt_1.CircularQueue({
  maxSize: maxSize,
  overwrite: true
})

let pushCount = 0
const isHalfSeen = () => ((pushCount++) % maxSize) > maxSize / 2
const purgeOld = () => {
  const front = circularBuffer.front()
  if (front && front.when < new Date().getTime() - 2592000000 /*month*/) {
    circularBuffer.pop()
    purgeOld()
  } else {
    return
  }
}
let fuzzyset = FuzzySet()
const checkSimilarity = (keyword) => {
  const similar = fuzzyset.get(keyword)
  if (!similar || (similar[0][0] < 0.8)) {
    fuzzyset.add(keyword)
    return false
  } else {
    return similar[0][1]
  }
}

const refreshTopK = (keyword) => {
  purgeOld()
  const similar = checkSimilarity(keyword)
  if (similar)
    keyword = similar
  circularBuffer.push([new Date().getTime(), keyword])
  // when half the queue is seen already
  // renew the topK bag
  if (isHalfSeen()) {
    console.log(('isHalfSeen'))
    // print the top search keywords so far
    for (let item of topk.values()) {
      console.log(`Item "${item.value}" is in position ${item.rank} with an estimated frequency of ${item.frequency}`)
    }
    pushCount = 0
    topk = new TopK(10, 0.001, 0.99)
    circularBuffer.forEach((elem, index, arr) => {
      topk.add(elem[1])
    })
  }
}

for (let index = 0; index < 10000; index++) {
  refreshTopK(makeid(5))
}

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.