What kind of model to use to find drivers when data is aggregated and not on user level?

I have a website and have info from Google Analytics. So I can see the following features:

  1. page url
  2. country
  3. device category (phone, desktop, etc.)
  4. Number of sessions
  5. Number of users: users who have initiated at least one session during the date range
  6. Avg. time on page
  7. Page views
  8. Bounce rate -- a probability calculated as single-page sessions divided by all sessions, or the percentage of all sessions on your site in which users viewed only a single page (e.g. directly following a link to a blog post, reading that blog post only, and leaving. I.e., not interacting at all with the site where that particular blog post is). For more info see https://support.google.com/analytics/answer/1009409?hl=en-GB. A bounced session has a duration of 0 seconds.

So you would see something like this:

page url | country |# sessions |# users |Avg. time on page | page views | bounce rate

x.com/blog1 | USA |100 | 40 | 1 minute |250 | 30%

x.com/blog2 | Mexico |20 | 10 | 5 minute |350 | 90%

x.com/blog3 | Panama |3 | 1 | 5 minute |10 | 0%

I am trying to find drivers for bounce rate so thought of predicting bounce rate using say, a random forest and looking at feature importance. However, this would be an easy case of a classification task if it were on a session level (is_bounce? i.e., did that person initiate a session and leave?) -- as bounce rate would be either 0% or 100%. However, because of the aggregation that can be seen above this cannot be done (in the first row of the above sample data out of those 100 sessions there could be bounced and not bounced sessions).

I don't know how much sense it makes to predict bounce rate if its not for classification. Because it would be clear that drivers or strong predictors would be low values on avg. time on page, page views, even page views per session, etc. I was hoping to see if country, device, or other features could drive bounce rate. Not necessarily time/users/sessions.

Any thoughts on how to find drivers for bounce rate based on the info above? How to phrase this problem? Or how to find ways to lower bounce rate?

The distribution of bounce rate by the way is pretty bimodal, with 85% being 0 or 100% bounce rate. So I could only keep 85% of the data and treat it as classification and use country, device, etc. as features.

Topic feature-importances hypothesis-testing decision-trees random-forest

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.