What is the typical things in Data that i have to look for, when implementing Survival Models using Machine Learning?

Problem Scenario

I am working on an industry specific problem focussed on predicting the failure of a seal/gasket in the given time interval(T) in a high-pressure-compression environment. Whenever this seal/gasket is broken there is loss of pressure and a leak. This leak is extremely dangerous. The gas in question is H2 and this makes things even scarier. The specific problem would be this, Predict the likelihood of this Seal Surviving past a time Ti provided that the event has not happened yet. This Ti is typically in the future for example, 2 days, 2 weeks, 2 months etc.

Dataset

The Dataset I have is timeseries, which are sensor measurements which have been collected over the past couple of months. Please note that these sensor readings are done every 100 ms and each machine has ~ 60 of these. There are 7 such machines where the gasket/seal had to be changed a couple of times (1-3 per machine) in the last year. So these would be the 'Event' in my prediction task. You could imagine the dataset as the following for each machine where S_1 to S_34 are the sensor readings for simplicity.

Timestamp;S_1;S_2;S_3;S_4;S_5;S_6;S_7;S_8;S_9;S_10;S_11;S_12;S_13;S_14;S_15;S_16;S_17;S_18;S_19;S_20;S_21;S_22;S_23;S_24;S_25;S_26;S_27;S_28;S_29;S_30;S_31;S_32;S_33;S_34
0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 -- No Seal Change
0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 -- No Seal Change
0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 -- No Seal Change
0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 -- No Seal Change
0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 -- Seal had to be changed

P.S. I am aware this isn't in the typical format survival analysis data is structured.

Problems I am facing - All related to Data Scarcity

  1. There is a huge class-imbalance if I can call it that. The times the seal have been changed is very little compared to all the data being logged. Any suggestions on how I can tackle this would be appreciated?

  2. Can I get suggestions for any other way to predict the likelihood of this Event happening in a given time Ti without Survival analysis if that is better?

  3. I would also know of any other methods usually used in Survival Analysis, of restructuring the data in a certain way if that helps me solve the Class Imbalance?

References

Blogs

Research Papers

Python Libraries

Topic survival-analysis deep-learning predictive-modeling machine-learning

Category Data Science


I'm not sure to understand all the ins and outs of your problem, however here are a few suggestions that might help:

  • Are the 60 sensors significant enough to make a prediction? Probably not. Maybe you could detect which ones are the most relevant first in order to have clean data and improve prediction results.
  • 100ms might be too precise for predictive ML models. In addition to that, you can't replace a seal if your prediction comes in 300ms. You will want to size the problem to a human or physical realistic scale. Maybe having minutes containing the average or the max value of the 100ms set of values? Maybe hours? Therefore, some study about chain reactions of your problem could be necessary to have the right time scale.
  • Like any other industrial problem, sensors' sensitivity may differ from one machine to another and there could be noise that alter them. If it is the case, you could also check if the sensors behavior are similar from one machine to another and set acceptance range, and reduce the noise to have comparable values between machines.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.