Given daily sequence of events with only event ID labels (alphanum strings), what algorithms can be used to detect sequences that are outliers?

For example, the data might be something like this:

Sequence 1: [ABC, AAA, ZZ123, RRZZZ45, AABBCC]
Sequence 2: [CBA, AAA, YY123, LMNOP, AABBCC]
Sequence 3: [ABC, AAA, ZZ123, RRZZZ45, AABBCC]
...
Sequence N: [DEF, AAA, ZZ123, YYZZZ45, AABBCC]

Sequence 1 and 3 are the same, but sequence 2 and N are different.

In the data set, there will be thousands of these sequences every day.

Additional questions:

  1. How could I calculate similarity (or difference) measure between sequences with sequences of labels like this? If so, how would I do this in Python? Examples?
  2. Is it possible to use clustering? How would I do that?
  3. Given a partial sequence, how could I predict, the remainder of the event sequence?

I appreciate your input.

Topic labels distance sequence outlier clustering

Category Data Science


After digging a little more, the answer was already there, but I just wasn't aware. It turns out that SequenceMatcher, in the Python Standard library, can handle this.

It can provide a few ratio match scores that seem to work well. You can essentially create a similarity score between sequence lists of strings.

I used this to create a custom similarity/difference function and used this with scipy.spatial.distance pdist and squareform to calculate similarity/distance matrices. This works well as long as each sequence is the same length.

In other datasets where they are ragged (not the same length), I used itertools.combinations to create a pairwise list of sequences to compare, compared each pair with my custom SequenceMatcher-based function, created a similarity list, generated similarity matrix with squareform(), calculated distance matrix, and used that for hierarchical clustering.


Assuming that sequences does not have any semantic meaning, we can adapt string distance measure like hamming distance or Levenshtein distance to compare similar sequences. Every entry in the list of sequnces will be considered like string and return similarity between sequnces. Libarary for these distance should be easily available in python.

For Sequence completion you could use something like Hidden Markov Model or Sequence models like RNN or LSTM.

Clustering may be a little diffcult on this. You could have done Hierarchical clustering on distance matrix but given 1000s of sequences this may be very slow.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.