How to combine data from multiple Google Trends queries effectively?

As you might know, Google Trends works by normalising a random sample of the search term data, with the sample changing at least once per day, from my experience. This is not an issue for western countries, but I am trying to conduct research based on search frequency in developing countries and here the lack of data makes some of the search samples very small and the data very scarce (sometimes as small as 0-2 data points for a period of 18 months).

As other ways to acquire Trends data from Google have failed, I have resorted to scraping data daily to investigate whether it can be combined in a way that would create the most representative sample possible (the closest to the raw data we don't have). I have done a quick test with the most rudimental way of combining the data sets possible: just picking the maximum value for each week from the available data sets - we want to avoid 0s, anything is better than a 0. I have identified three different categories of search terms as detailed below. The left graph is simply a plot of all the data sets and the right is the results of the rudimental combination I did.

  1. Popular terms where the data is consistent

Here there is no point of doing this, since each data set is robust and has a value for each day. If fact this worsens the quality of the data for the purposes it is needed. Perhaps it will be interesting to experiment with average/median values and see how they perform (see next point).

  1. Weak terms with some data

As we can see, despite looking funny with many data points at 50 and 100, the product is quite decent and some of the resulting data sets perform better than a random sample. Would a better approach here be to take the average or the median, also should the samples that produce 0 be considered when calculating those metrics? In theory if a certain data point has appeared as 0 many times, does that mean the data for that week is scarcer than for other weeks where it might be 0 less frequently? I can't prove it, but I feel like any non-zero data should be included in the combined dataframe, rather than lost with the median and average metrics (median of [0,0,0,0,5] will be 0 and the average would be 1...)

  1. Very weak terms that can not be salvaged

There simply is too little data and for this term and even multiple days of scraping returns single spikes usually have a high value (80-100) and tend to overlap. I don't think any of the aforementioned techniques would be useful, since the data points seem to always be big numbers, but I am open to ideas.

Topic normalization time-series google data-cleaning data-mining

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.