What is a good way to start Data Analysis of unknown dataset (JSON data)

I am working with an organization to analyse their data residing in Mongodb and to look for any trends/patterns in the data. I am quite new to the professional field of Data Analysis but have a good background of Statistics and Data Mining (University coursework). I will be doing a proof of concept on the data to understand if the data the organization is gathering is good for Analytics and if no what enhancements should they include in their datasets to make it better for Analytics. I do have some predefined questions that I am planning to answer but looking strategically what would be a good way to approach this kind of problem.

I have previously worked with some of the Kaggle datasets and projects in University where datasets were fixed and questions to be answered were given.

Topic json mongodb data-cleaning data-mining machine-learning

Category Data Science


Whatever business is behind this data, the biggest challenge is to try extract fields/keys that will bring valuable information.

This could be done by browsing data, get the schema and try to find deviations. As mongo uses dynamic schema this will give you possibility to observe different documents type in one collection - and can be source on valuable/invaluable data.

You can use robomongo or mongodb compass to have some data visualisation and get a grip.

Learning mongo aggregation framework and/or map reduce syntax is a must in such case.

Happy mining!


@Emre makes a great point. You should probably ask the business what they're trying to accomplish because only then will you know which data matters. You don't want to waste time cleaning up data that has minimal importance to the business.

Regardless of the business objective though, it's worth pointing out that Mongo DB is a pretty poor choice for storing analytical data. There is no enforced schema, so each record might be in a completely different and unexpected format. You might have entire columns missing, maybe no columns at all, wrong data types, duplicates, and no way to get easy summary statistics like you would in a relational database. You're pretty much going to have to do a database dump of your MongoDB data and probably comb through your records semi-manually, looking for commonly occurring schemas and going from there.

Summary: MongoDB is great for operational data when all you want to do a quick lookup. It's a poor choice when you want to do any sort of in depth analysis on it though. Relational databases are better for analytical queries because the data is structured and also because you can more easily enforce data quality.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.