How would you count top repetitions in a stream (grouping and summing similar strings too)
I would look in an arbitrary window (already relative so unbased), count each element (repetitions) then rank them. Continuously look, group and sum similar strings.
There are two options:
Keep a limited bag of results
The logical problem that appears to me, in a limited bag of ranked strings, say top 10, is if I remove all keywords with no repetitions, or those with fewer repetitions, I would lose track on them and would re-count them from zero.
Unlimited bag of results
Seems easier because determined, but technically this would blow memory. (I assume, I'm still not sure though about the bandwidth)
Any hints ?
Context
As you can imagine, I want to add a section in my website with popular searches.
For instance:
- The lookup window would be: One month
- The bag of top searches would be of length 10
- I could keep a stack of say 1 million distinct Strings to run similarity and reevaluate counts each time (on each new element or for yet another time window).
Finally, calculating top searches is not as easy as it seems !!!
Topic data-stream-mining
Category Data Science