"Hadoop" formats for user database: online advertising

I was wondering if someone could point me to suitable database formats for building up a user database:

basically I am collecting logs of impressions data, and I want to compile a user database

which sites user visits, country/gender/..? and other categorisations with the aim of a) doing searches: give me all users visiting games sites from france... b) machine learning: eg clustering users by the sites they visit

so I am interested in storing info about 100's of millions of users

with indexes? on user, sites, geo-location

and the idea would be that this data would be continually updated ( eg nightly update to user database of new sites visited etc)

what are suitable database systems. Can someone suggest suitable reading material? I was imagining Hbase might be suitable...

Topic hbase

Category Data Science


Storing user profiles

If you just want to store all user profiles... just save them into normal RDBMS. Assuming one user profiles takes 10Kb of storage, you need only ~9.5Gb for every million of users, which is pretty little and gives you all advantages of mature relational databases.

It makes sense to use HBase only when you have really many users (say, > 1B) or when data is very sparse (most columns are empty). But don't expect it to be as convenient as good old SQL databases.

In advertising, and especially in real-time bidding, very fast retrieval of user profiles is needed. Aerospike becomes more and more popular for this task.

Analysing data slices

Common use of business logs is to analyse specific slices of data, e.g. number of users from France that visited sites from "game" category on November 1-14, 2014. Standard way to manage such data efficiently is to organize them into data cubes. You won't get individual records (e.g. users), but you'll get aggregated statistics really fast.

Such cubes may have many different dimensions, but in 99% of cases they have date field that they are partitioned by. It makes great sense, because almost every query includes time period to get data from.

As for software, Vertica is great for such aggregations. Cheaper* solution from Hadoop world is Impala, which is also great.

(* - if you count only license price)

Machine learning

It really depends on concrete tasks and ML toolkit in use. For real-time bidding you would want blazing fast access to user profile vectors and would probably prefer Aerospike. For online learning Spark Streaming may be used as a data source, and no storage used at all. For offline machine learning there's excellent MLlib from the same Spark project, which works with a variety of sources.


The kind of data you store and analyze is very much dependent upon the kind of data you can gather. So, without knowing what your 'impression data' looks like, it is very hard to suggest how to normalize and store it.

Furthermore, the way you store data is also dependent upon how you wish to analyze it. For example, if you want to perform basic analytics like page view counts, how many pages a user visits per session, etc (SQL). . . data needs to be stored differently than if you want to build recommendations based on traffic patterns (graph database).

Please edit your question to include more detail. Apologies that I cannot simply leave a comment.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.