What is the formal category of problem described by identifying consecutive occurrences of attributes in records?

Apologies for the garbled title, I'd really need to know the answer to the question before I could phrase it properly...

Let's imagine I've got a data set of football(soccer if you prefer) match results

Let's further imagine that each result has the following attributes

  • Date
  • Venue
  • Team
  • Opponent
  • Home Team Goals
  • Away Team Goals
  • Result

Then let's consider a future match, for which we know some attributes but not all (obviously, because it hasn't happened yet)

  • Date - W
  • Venue - X
  • Team - Y
  • Opponent - Z

Given the future match, and the set of results, I want to produce some interesting pieces of information that are relevant to the given future match. The interesting part is probably still something of a manual step, so the automated part is really finding ALL sequences so that they can then be picked out

For example:

  • Team Y have won their last 3 games
  • Team Z have lost their last 3 games
  • Team Y have won their last 2 games against Team Z
  • Team Y have won their last 6 games against Team Z at Venue X

These examples are trivial, but the trick I am looking for is to algorithmically compose the qualification criteria - i.e. Team Y or Team Y against Team Z

Don't think it's relevant to the question but three heuristics for semi-automating the process of selecting the 'interesting' sequences from the set of all sequences will be:

  • Preferring sequences that have been done the fewest number of times previously (so Team Y has won 3 games in a row for the first time supersedes Team Y has won 3 games in a row for the third distinct time)
  • Preferring the most general sequence of the same length (So Team Y has won 3 games in a row supersedes Team Y has won 3 games in a row against Team Z)
  • Preferring sequences of greater length

I feel absolutely certain this must be a common category of problem with common algorithms and tools but when I try to google it, I'm not getting any useful results - I presume because I am using the wrong terminology - whenever I look for anything related to sequence detection, I get information related to sequence databases - and that's not really what I have, I've got something rather more akin to a transaction database of itemsets

Can anyone give me some guidance on:

  1. Terminology for this type of problem (so that I can use this information to identify...)
  2. Common algorithms used to tackle it
  3. Common tools used to tackle it

Topic sequence data-mining

Category Data Science


What you describe is feature engineering, or specifically handcrafting feature. Specifically, sequence detection is the indeed the term for crafting features like "team X lost their last 5 games". If you think about it, you get sequential data if you order transaction data by time.

Basket and association rule analysis are classical methods which study this kind of problem. Modern deep learning domain would point you to the class of sequential models e.g. RNN and LSTM.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.