What would be the right tool for gathering data structure analytics in a data stream?

We are processing pretty big number of JSON objects (hundreds of thousands daily) and we need to gather insights about the processed objects. We are interested in gathering the following analytics of each field of the processed JSON objects:

  • Percentage when present and missing (null/empty can be considered an equivalent of missing)
  • Possible values and percentage when the value is used for high frequent values
  • Possible number of elements in an array and percentage when the number occurs

Because the processed JSON objects are coming from different sources, we need to be able to filter statistics based on the source. Also we need to filter statistics based on time (i.e. this week, last 90 days, this year etc) and based on certain predefined values of some fields (i.e. field A == X or field B == Y). The filter are predefined and we don't need to many of them.

So ideally the result for the query like last 30 days from source S1 when A == X would look like:

  • a - present 90%
  • a.b - present 40%, enum values: A - 30%, B - 25%, C - 10%, D - 5%
  • a.c - present 80%, array, length: 1 - 90%, 2 - 2%, 3 - 1%
  • d - present 40%, enum values: X - 50%, Y - 40%, Z - 5%
  • e - present 5%, int
  • g - present 10%, string

The structure of the JSON objects is known and it's the same for all sources, but we don't necessarily know enum values (even if they are relatively a few).

Any hints about specialised tools we could use to process, store and query such statistics will be great.

Topic data-stream-mining

Category Data Science


Welcome to datascience.stackexchange,

There are 3 phases to your project: Ingesting, Storing and Querying. But as far as the different Tecnological Elements you might look at it as 2: Stream Processor and a NO-SQL DB.

The Stream-Processing has many options, and in general might be chosen by:

  1. The publisher's (the "source") end-point technology (message queue service, cloud storage, etc..)
  2. The stream's Volume and Velocity (the rest of the big-data "Vs" are well defined here) with regard to your system's required SLA
  3. The targeted NO-SQL DB (can require / excempt a pre-processing stage, different cluster properties to handle the stream's "Vs", etc..)

The NO-SQL DB has even more options, and in general might be chosen by:

  1. NO-SQL data family type (tabular, document, key-value, time-series, graph, etc..)
  2. Usage user-stories (query types, frequency of usage, etc..)
  3. Scale requirement (the read / write throughput requirement, mainly related to the "Vs" but also to the query usage)

=====

The NO-SQL options can be better filtered in this early stage, since you provided a lot of details on the nature of your data. So I'd start with it:

The nature of your data has 2 important elements: a purely json format, and a strong time-series like usage.

This mainly suggest you'll need a NO-SQL DB that is:

  1. Document (json like) based
  2. Time Series based
  3. Scaleable

I can think of 2 strong candidates:

ElasticSearch vs MongoDB

Both are:

  • Scaleable document DBs from day 1, with elastic being a time-series based from day 1, and mongo adding time-series support since its v5.
  • Both have a free open-source version vs a managed paid service you can opt for

Elastic

  • Strong built-in stats regarding its field - might be usefull for your per-attribute stats
  • The ability to partition the DB to different indices by the time axis (each represent a time-sliced db). This contributes to a series performance boost when your queries are time sliced (like in your case). This could be further improved by slicing the indices further per-source, per-time-slice

MongoDB

  • Pretty much pioneered the BSON data type, strongly known for its efficiency handling stored binary json.

=====

I recommend start testing one of the above (or other that fits the category, usage, etc..). See if you can find a larger common-ground for any of them to support the different sources' consumption technology, the "Vs" and SLA requirements. For ex: Elastic has the Logstash component that claims to support a variety of data sources plugins. This might help you filter the Stream-Processing options further.

There are no definite options here, and you should also consider by the current personnel knowledge available at you disposal. (existing client-side languages, dev-ops server-side technology familiarity, dba's existing skills, etc..).

BTW, the abilty to filter by "source" could be either adding a source attribute to your json, or excempting that in case you're going with the index-per-source solution.


One possible option could be Apache Kafka, an open-source distributed event streaming platform that supports data integration, data pipelines, and streaming analytics. It is possible to use Kafka to perform window quieries on streaming JSON.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.