Suggestions for studying *Clickstream* data

I've essentially been handed a dataset of website access history and I'm trying to draw some conclusions from it.

The data supplied gives me the web URL, the datetime for when it was accessed, an the unique ID of the user accessing that data. This means that for a given user ID, I can see a timeline of how they went through the website and what pages they looked at.

I'd quite like to try clustering these users into different categories (it's obvious that some users look at a specific portion of the website compared to others) but I really don't know how to do this.

Things I've looked at:

  • Markovclick - This allows me to supply a clickstream of pages, and get a Markov Probability Matrix. I've binned the number of pages down to around ~60 but this library doesn't allow for comparing users which accessed exclusive pages.
  • Predicting website exits with machine learning - I quite like the approach here for calculating various metrics based on a user's history but I haven't found anything particularly interesting.

Are there any suggestions for approaches? I'm quite surprised at how little I've managed to find on this kind of work because I naively assumed this is a very popular topic.

Many thanks

Topic web-scraping markov-hidden-model markov-process

Category Data Science


One solution is to vectorize the data (node2vec or graph2vec) and then apply dimentionality reduction to have clear clusters, just like this simulator using tSNE or UMAP: https://projector.tensorflow.org/

In this way, you will have clear clusters of users according to the web pages visited.

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

https://umap-learn.readthedocs.io/en/latest/

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.