What would be the right tool for gathering data structure analytics in a data stream?
We are processing pretty big number of JSON objects (hundreds of thousands daily) and we need to gather insights about the processed objects. We are interested in gathering the following analytics of each field of the processed JSON objects:
- Percentage when present and missing (null/empty can be considered an equivalent of missing)
- Possible values and percentage when the value is used for high frequent values
- Possible number of elements in an array and percentage when the number occurs
Because the processed JSON objects are coming from different sources, we need to be able to filter statistics based on the source. Also we need to filter statistics based on time (i.e. this week, last 90 days, this year etc) and based on certain predefined values of some fields (i.e. field A == X or field B == Y). The filter are predefined and we don't need to many of them.
So ideally the result for the query like last 30 days from source S1 when A == X would look like:
- a - present 90%
- a.b - present 40%, enum values: A - 30%, B - 25%, C - 10%, D - 5%
- a.c - present 80%, array, length: 1 - 90%, 2 - 2%, 3 - 1%
- d - present 40%, enum values: X - 50%, Y - 40%, Z - 5%
- e - present 5%, int
- g - present 10%, string
The structure of the JSON objects is known and it's the same for all sources, but we don't necessarily know enum values (even if they are relatively a few).
Any hints about specialised tools we could use to process, store and query such statistics will be great.
Topic data-stream-mining
Category Data Science