Opensource tools for help in mining stream of leader board scores

Consider a stream containing tuples (user, new_score) representing users' scores in an online game. The stream could have 100-1,000 new elements per second. The game has 200K to 300K unique players.

I would like to have some standing queries like:

  1. Which players posted more than x scores in a sliding window of one hour
  2. Which players gained x% score in a sliding window of one hour

My question is which open source tools can I employ to jumpstart this project? I am considering Esper at the moment.

Note: I have just completed reading "Mining Data Streams" (chapter 4 of Mining of Massive Datasets) and I am quite new to mining data streams.

Topic data-stream-mining tools

Category Data Science


I've read very good article recently that suggests using Twitter storm for a task that looks pretty similar to yours.


This isn't a full solution, but you may want to look into OrientDB as part of your stack. Orient is a Graph-Document database server written entirely in Java.

In graph databases, relationships are considered first class citizens and therefore traversing those relationships can be done pretty quickly. Orient is also a document database which would allow you the kind of schema-free architecture it sounds like you would need. The real reason I suggest Orient, however, is because of its extensiblity. It supports streaming via sockets, and the entire database can be embedded into another application. Finally, it can be scaled efficiently and/or can work entirely through memory. So, with some Java expertise, you can actually run your preset queries against the database in memory.

We are doing something similar. In creating an app/site for social science research collaboration, we found ourselves with immensely complex data models. We ended up writing several of the queries using the Gremlin Traversal Language (a subset of Groovy, which is, of course, Java at its heart), and then exposing those queries through the binary connection server of the OrientDB. So, the client opens a TCP socket, sends a short binary message, and the query is executing in Java directly against the in-memory database.

OrientDB also supports writing function queries in Javascript, and you can use Node.js to interact directly with an Orient instance.

For something of this size, I would want to use Orient in conjunction with Hadoop or something like that. You can also use Orient in conjunction with esper.

Consider: An introduction to orient: http://www.sitepoint.com/a-look-at-orientdb-the-graph-document-nosql/

Complex, real-time queries: http://www.gft-blog.com/business-trends/leveraging-real-time-scoring-through-bigdata-to-detect-insurance-fraud/

A discussion about streaming options with java and orient: https://github.com/orientechnologies/orientdb/issues/1227

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.