Are there any advantages of using survival analysis models like Cox’s proportional hazard model with uncensored data over simple linear regression or other classic ML models? I have data with recurrent events and I try to predict the time of the next event. Data contains about 2000 different subjects and about 60 events per subject. The percentage of censored data (the last event of each subject) is small, and I don't think it plays a big role in the prediction.
I'm quite new to machine learning and statistics. I've a dataset from some ecommerce sale's history. It's almost 2k instances, and features include personId (string), productCategory (string/discrete), amountPaid (float/continuous), purchaseTime (string/Time(DD/MM/YYYY)). Person can purchase product at any time (irregular time interval so I can't use time series analysis, I guess). I want to know when will the same person (attr with person Id) make just next purchase in a category (attr with productCategory). What ML model should I use for …
I'm trying to predict what's the expected LTV of a subscriber, since monthly revenue and costs are almost constant I need only to predict the survival function, where the terminal event would the subscription cancelation request. I proposed the following formula to estimate LTV: $LTV = (Membership - Cost)*mean\ residual\ life(x)$ where: $mean\ residual\ life(x)=E(X-x|X>x)= \frac{\int_{x}^{\infty}S(t)dt}{S(x)}$ In my case I have data of all subscribers over the last 10 years (more than 3 million data points where 1 million are …
I've been looking into the Cox Regression method for Survival Analysis in Churn Prediction. Cox regression will allow us to determine the probability that a subscriber will unsubscribe after a time $t$, defined by the hazard rate: $$ h(t \lvert X_i ) = h_0(t)exp\big( \boldsymbol{\beta} ^T\boldsymbol{X}_{i} \big) $$ Where $h_0(t)$: Baseline Hazard is a prior Probability that any customer churns at time t when all influencing factors are 0. $\boldsymbol{\beta} \in \mathbb{R}^D$: Exponent of each Coefficient gives us a Hazard …
Basically what the question above asks. KM survival function considers censored data untill it is censored. But, how will the change in each point of time would be affected if we assume from the start that there is no censoring at all in the data? Thanks in advance!
The xgboost package enables survival modeling using parameter arguments: objective = "survival:cox" and eval_metric = "cox-nloglik". The predict method for the resulting model only outputs risk scores (same as type = "risk" in the survival::coxph function in r). How do I use xgboost to predict entire survival curves?
I am trying to learn how to use the Kaplan-Meier survival estimator model in the lifelines package. The documentation says that the KaplanMeierFitter.fit function returns "a modified self, with new properties like 'survival_function_'." I checked what the survival_function_'s contents are - it seems to contain the average survival probability for all the players in the dataset at each time time interval. For example, in my dataset, there are 66 months and about 250,000 players (i.e., individuals whose death event we …
I have a time series data set with the lifecycle of 9000 different B2B sales leads. What I call lifecycle consists of a dataset with one registry per day for every different sales Lead identifier with 4 predictive variables (DAYS_SINCE_START, LEAD_ID, CUSTOMER_INTEREST, MARKET, TYPE_SERVICE) and one response variable (OUTCOME). The response variable outcome can have 2 different values: Won (1) or Lost (0). A mock example of the data frame would be the following: As it can be seen, some …
We are working on a problem related to survival analysis. We have already implemented Cox Proportional-Hazard Model and Accelerated Failure Time algorithm. Now we want to see how the covariates change over time. So we decided to implement AalenAdditiveFitter from the lifelines library. Here is a dummy data presented. Data shape is (1341799, 4). Gender Disability_level Time_to_event Event 1 Female Mild 50 0 2 Male Moderate 70 1 3 Male Severe . . . 1341799 Female Mild 45 1 Now, …
I have a histogram of values of test setup network. Values are from iperf 2.1.6. I send stream of data and get how many packets are in a bin of microseconds. bin(w=100us) I lose some packets sometimes. Question: I am wondering how to correctly take in account the lost packets when plotting CCDF For now I am calculating Y-axis values with: (lost_packets + cum_sum(x))/total_packets actual code delay_data = np.random.uniform(low=5, high=62.4, size=(110,)) count_data = np.random.uniform(low=1, high=800, size=(110,)) df = pd.DataFrame({"count_bin": count_data, …
I would like to ask how to deal with new entries of individuals in Survival Analysis. I have a study about the time to event of several individuals who suffer from a disease. The study starts on a specified date (let's assume 1/1/2019). The individuals on this date are 50. The study lasts 6 months. In these 6 months, more individuals must be included but they were not present on the starting date. I have not any left censoring because, …
I have a fixed term of, say, one year. At the end of the term there is an observation of true / false, say a customer either renews or cancels their subscription. This decision is probably based on the occurrence of certain events, say "how many times did they use the service?", and maybe even the specific timing within this term. At the beginning of the term (say, day 1) I don't have any behavioral information, so I can just …
I’m working on a survival analysis to predict 1-year mortality. I’m trying to build a custom score function that maximizes mean time-dependent AUC. Here is a description of the time-dependent AUC metric from the sckikit-survival package. This custom score function would be used in the GridSearchCV to select hyperparamters. The challenge is that the time-dependent AUC metric requires calling on survival_train. Is it possible to call survival_train within cross fold validation? Here is a layout of the code: # Instantiate …
I have a problem where every observation has a binary outcome that occurs at the end of a fixed period, and the predictor variables describe a few types of event that either happen on some day within that period or do not happen at all. For example: Outcome Days Until First Phone Call Days Unit Second Phone Call TRUE 3 14 FALSE 25 63 FALSE 16 NA Of course I can convert the predictor columns to binary and use logistic …
The standard survival analysis model - for example the model which forms the basis for the proportional hazards model - assumes the hazard rate is constant. In many applications this would be the exception rather than the rule. What parametric model would be appropriate for data such as this: % retention 70% 80% 85% 90% 90%
Problem Scenario I am working on an industry specific problem focussed on predicting the failure of a seal/gasket in the given time interval(T) in a high-pressure-compression environment. Whenever this seal/gasket is broken there is loss of pressure and a leak. This leak is extremely dangerous. The gas in question is H2 and this makes things even scarier. The specific problem would be this, "Predict the likelihood of this Seal Surviving past a time Ti provided that the event has not …
This question stems from an approach proposed by Dr. Silverman, "Predicting Horse Race winners through A Regularized Conditional Logistic Regression with Frailty." In this paper, he proposes a modified Cox Proportional Hazard model including a frailty parameter taken from Muriel Gillick's article, "Guest Editorial: Pinning Down Frailty." The loglikelihood with frailty has the form: Where: $ X^{w}_{rh} $ = characteristics of the horse that won race r $\beta$ are the parameters to be estimated $w^{w}_{rh}$ is the frailty indicator of …
I am working on a problem to estimate task completion time in kanban (project management tool). While doing EDA, I looked at tasks that are either done or cancelled. In this case, I defined the completion time as the time taken from task creation to done/cancelled. I noticed I am running into an issue with that definition. I am disregarding tasks that have not been done yet. If we think of "task = done" as "event = 1", this is …
I should make prediction on survival data, using the random Forest method. My question is: should I follow the same approach as in logistic regression? taking into account only the status variable or whether I should take into account the delay to the event? Are there any specific R functions for survival analysis other than randomForest? Or could I use this function for survival analysis as well? I've seen a function called ranger() that seems to do random forest on …
I am trying to analyze the effect of a particular business rule on customer behavior. Background: I have two call centers operating in my company. One is an in-house call center and the other one is a third party. The incoming calls are handled by these two call centers based on some rules. 2 months before we changed some operational rules after which all the calls will be routed to call center A and then if not attended to call …