How to score different clusters of features for predictiveness?

Question

How to score different clusters of features for predictiveness?

Edvard-D

2021年12月22日 18:04

I have a set of true/false data that represents whether or not a given feature was or was not active when the data snapshot was recorded. Data snapshots are recorded when the user takes an action. The goal is to find clusters of features that were true at the same time that are predictive of the user taking said action.

To provide some more context, I'm working on a program that is meant to analyze data recorded while players play World of Warcraft. The actions I referred to earlier are abilities the player can use in game. The features are varied and represent the state of the game, but are all either true or false. My goal is to create a system that can take these data snapshots and learn when different abilities should be used. This will then be used in an addon (aka game mod) to suggest to the player what ability to use at any given moment. There will be two phases of data analysis: a first pass of data snapshots that happens on the players' computers and uploaded to the cloud, and then a second phase where this aggregated data gets further analyzed in the cloud.

Right now the plan is to use agglomerative hierarchical clustering (AHC) as a first step to try and group features together into clusters. The reason for this approach is that it is very likely that an ability should be used when both cluster A and cluster B are true separately, but also when A and B are true at the same time. Using AHC allows those sub clusters to be identified.

The problem I have is that I'm not sure how to decide which clusters are good predictors of an ability being used. As an example, imagine a situation where cluster A is predictive of an ability being used while cluster B isn't. In the AHC process eventually a super cluster AB will be created. This cluster is not predictive of an ability being used since it has the extra B elements. At the same time, cluster A may be made up of points 1, 2, and 3. These points may all need to be true at the same time in order to be predictive of an ability being used; one or two of these points grouped together should not be considered predictive. The system should be able to identify that cluster A is predictive, while AB and any combination of points 1, 2, and 3 other than clustered together as A are not.

How would you go about scoring how predictive a given cluster of features is? It's important that the scoring is done at the cluster level, not at the individual feature level.

Topic agglomerative unsupervised-learning clustering

Category Data Science

How to score different clusters of features for predictiveness?

About