What is the formal category of problem described by identifying consecutive occurrences of attributes in records?
Apologies for the garbled title, I'd really need to know the answer to the question before I could phrase it properly...
Let's imagine I've got a data set of football(soccer if you prefer) match results
Let's further imagine that each result has the following attributes
- Date
- Venue
- Team
- Opponent
- Home Team Goals
- Away Team Goals
- Result
Then let's consider a future match, for which we know some attributes but not all (obviously, because it hasn't happened yet)
- Date - W
- Venue - X
- Team - Y
- Opponent - Z
Given the future match, and the set of results, I want to produce some interesting pieces of information that are relevant to the given future match. The interesting part is probably still something of a manual step, so the automated part is really finding ALL sequences so that they can then be picked out
For example:
- Team Y have won their last 3 games
- Team Z have lost their last 3 games
- Team Y have won their last 2 games against Team Z
- Team Y have won their last 6 games against Team Z at Venue X
These examples are trivial, but the trick I am looking for is to algorithmically compose the qualification criteria - i.e. Team Y or Team Y against Team Z
Don't think it's relevant to the question but three heuristics for semi-automating the process of selecting the 'interesting' sequences from the set of all sequences will be:
- Preferring sequences that have been done the fewest number of times previously (so Team Y has won 3 games in a row for the first time supersedes Team Y has won 3 games in a row for the third distinct time)
- Preferring the most general sequence of the same length (So Team Y has won 3 games in a row supersedes Team Y has won 3 games in a row against Team Z)
- Preferring sequences of greater length
I feel absolutely certain this must be a common category of problem with common algorithms and tools but when I try to google it, I'm not getting any useful results - I presume because I am using the wrong terminology - whenever I look for anything related to sequence detection, I get information related to sequence databases - and that's not really what I have, I've got something rather more akin to a transaction database of itemsets
Can anyone give me some guidance on:
- Terminology for this type of problem (so that I can use this information to identify...)
- Common algorithms used to tackle it
- Common tools used to tackle it
Topic sequence data-mining
Category Data Science