Given daily sequence of events with only event ID labels (alphanum strings), what algorithms can be used to detect sequences that are outliers?
For example, the data might be something like this:
Sequence 1: [ABC, AAA, ZZ123, RRZZZ45, AABBCC]
Sequence 2: [CBA, AAA, YY123, LMNOP, AABBCC]
Sequence 3: [ABC, AAA, ZZ123, RRZZZ45, AABBCC]
...
Sequence N: [DEF, AAA, ZZ123, YYZZZ45, AABBCC]
Sequence 1 and 3 are the same, but sequence 2 and N are different.
In the data set, there will be thousands of these sequences every day.
Additional questions:
- How could I calculate similarity (or difference) measure between sequences with sequences of labels like this? If so, how would I do this in Python? Examples?
- Is it possible to use clustering? How would I do that?
- Given a partial sequence, how could I predict, the remainder of the event sequence?
I appreciate your input.
Topic labels distance sequence outlier clustering
Category Data Science