Process mining with ML
I have a little more general question. My dataset consists of N sequences of events. Example of one sequence could be [A,B,C,D,X,Y] and another [A,B,Z], where letters represent different events. The sequences are at most 80 steps long.
The idea is to predict next letter or next step from known previous events. For very simple example maybe after A will always come B. Next step would be measuring time of each event and the ultimate goal is to predict how long until process reaches specific event.
I tried N-gram, MLP neural network and lastly LSTM network, which had around 80% accuracy.
That would not be bad if the events were balanced in the dataset. To account for that I used weighted loss function in training of the LSTM and then the overall accuracy is around 66%. However the less frequent classes have much much higher accuracy (still not perfect, but higher). How can I create model that will have the best of both? That will learn the less frequent AND the most frequent at the same time.
Also I have read that tree base methods perform very good on unbalanced dataset. However all examples always consider one big timeseries data. My data are many short timeseries. Is it possible to train RandomForest on such data? How?
If you know about different algorithm/method that could be applied to such data please post it :)
Thank you.
Topic lstm sequential-pattern-mining machine-learning
Category Data Science