MongoDB Groupby Rank

Im Working With Mongodb And Wanted to do a query using Aggregate fucntion. Query Is Each city has several zip codes. Find the city in each state with the most number of zip codes and rank those cities along with the states using the city populations. The documents are in the following format { "_id": "10280", "city": "NEW YORK", "state": "NY", "pop": 5574, "loc": [ -74.016323, 40.710537 ] } I was able to count no of Zipcodes for each state …
Category: Data Science

Which is faster: PostgreSQL vs MongoDB on large JSON datasets?

I have a large dataset with 9m JSON objects at ~300 bytes each. They are posts from a link aggregator: basically links (a URL, title and author id) and comments (text and author ID) + metadata. They could very well be relational records in a table, except for the fact that they have one array field with IDs pointing to child records. What implementation looks more solid? JSON objects on a PostgreSQL database (just one large table with one column, …
Category: Data Science

Deal with huge amount of data

I'm writing to get advices about my project. I want to make recommander system for shop with some products. In fact i want to recommand to shop A to take item X because shop B sell this item and shops A and B are very similar. The "problem" here is the size of the data : i have around 5TB of raw data (about 8 000 000 000 lines) So it's very difficult to do something with huge data like …
Category: Data Science

Is Elastic Search recommended if attribute getting search is not a huge text document?

We are currently developing a system with MEAN stack with Mongodb at backend. We have employees name, and Ids in our system and our client wants to get pretty good (Read: Google Like) search in our system to search for employees' records. He needs our system to recommend employees even if he has misspelled the name, etc. One of the suggestions from our development lead was that we should use elastic search but from what I have seen, elastic search …
Category: Data Science

Should I use MongoDB instead of storing data in CSV in python?

I am currently storing data crawled from multiple websites having same but still different structure so every crawler is saving data in separate csv. I am planning to store the data using MongoDB instead of storing it in csv. Will this be beneficial in saving space ? Overall will this be advantageous to do or will there be any drawbacks apart from me having to change the code ?
Category: Data Science

Data representation (NoSQL database?) for a medical study

Problem description I have a data set about 10000 patients in a study. For each patient, I have a list of various measurements. Some information is scalar data (e.g. age), some information is time series of measurements, some other information can be even a bitmap. The individual record itself can be quite thick (10kB to 10MB). The data is to be processed practically in two steps: Preprocessing at the level of individual records (patients), i.e. to extract some features in …
Category: Data Science

How to dowload Wikileaks Cable Leaks documents as text corpus?

I'd like to perform NLP analysis on Wikileaks US Diplomatic Cable Leaks documents (https://wikileaks.org/plusd/), preferably as Python's NLTK3 corpus od Mongo DB documents. I couldn't find any option for download these in any raw text format, so I'm afraid I'm forced to apply some kind of scraping I guess, but I'd be thankful if anyone would give a clue for some simpler solution, if exists any.
Category: Data Science

Database options for JSON storage, queried with Apache Drill

I am planning to set up a JSON storage system. It will store tens of millions of JSON records, all in the same format. I'd like to be able to query the data using Apache Drill. It looks like there is Drill support for MongoDB and Postgres. However, I'm unsure of the pros and cons of each, and how I'd structure the schema if I'd choose Postgres.
Category: Data Science

What is a good way to start Data Analysis of unknown dataset (JSON data)

I am working with an organization to analyse their data residing in Mongodb and to look for any trends/patterns in the data. I am quite new to the professional field of Data Analysis but have a good background of Statistics and Data Mining (University coursework). I will be doing a proof of concept on the data to understand if the data the organization is gathering is good for Analytics and if no what enhancements should they include in their datasets …
Category: Data Science

Storing Sensor Data for Analysis of the Office

I have currently been tasked with designing an application that tracks several different measurements around the office, eg. the temperature, light, presence of people, etc. Having never really worked on data analysis before, I would like some guidance on how to store this data (which database design to use). What we're looking at currently are around 50 sensors that only send data when an event of interest occurs: if the temperature changes by 0.5 degrees or if the light turns …
Category: Data Science

Which open-source sgdb for kind of large data

I have a 7 giga confidential dataset which I want to use for a machine learning application. I tried : Every package recommanded for efficient dataset management in R like : data.table, ff and sqldf with no success. Data.table needs to load all the data in the memory from what I read, so it's obvious that it will not work since my computer has only 4g RAM. Ff leads to a memory error too. So I decided to turn to …
Category: Data Science

Can map-reduce algorithms written for MongoDB be ported to Hadoop later?

In our company, we have a MongoDB database containing a lot of unstructured data, on which we need to run map-reduce algorithms to generate reports and other analyses. We have two approaches to select from for implementing the required analyses: One approach is to extract the data from MongoDB to a Hadoop cluster and do the analysis completely in Hadoop platform. However, this requires considerable investment on preparing the platform (software and hardware) and educating the team to work with …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.