What to do when one feature has very large importance/weight?

I am new to Data Science and currently am trying to predict customers churn for a company that offers of subscription-based bookings management software. Its customers are gyms. I have a small unbalanced dataset of a historical data (False 670, True 230) with 2 numerical predictors: age(days since subscription), number of active days in the last month(days on which a customer(gym) had bookings) and 1 categorical: logo (boolean, if a customers uploaded a logo in a software). Predictors have following …
Category: Data Science

Behavioural data required to predict churn

I am trying to build a predictive churn model that will identify customers who are likely to churn. I am defining a churned user as someone who hasn't transacted within 60 days. 90% of all transactions occur within 60 days of one another so this feels reasonable. I have very limited behavioural data; however. I have a record of a user's transactions and I have access to Google Analytics (GA). GA does not, however, allow me to track the specific …
Category: Data Science

Logistic Regression for prediction

I would like to ask about the theoretical approach of using Logistic Regression for customer data and more specifically Churn Prediction (in BigQuery and Python). I have my customer data for an online shop and I would like to predict if the customer will churn based on some characteristics. I have created my dataset and the Churn label (based on the hypothesis that if the customer hasn't bought something in the last year then it is assumed that the customer …
Category: Data Science

Decision tree to get difference in rates in two groups?

I have two sample groups of customers, each customer has 100s of features. For a single sample, i would use Decision Trees to find sub-groups that have a high churn rate. Thats easy. However, my requirement is: between two samples (below), find segment(s) such that in one sample its churn rate is high and in the other, it is low. In other words, find a sub-group which has the highest difference in churn rate. What is an appropriate algorithm to …
Category: Data Science

How do you effectively predict the top 20% most likely customers to churn from a dataset?

I am looking to work out that if I have a dataset with 100,000 existing customers who didn't churn and 20,000 previous customers that churned in the past and the business objective is to target the 20% of customers most likely to churn within the business, how would that be done? For example, we would have to take this dataset and split it into a training and test set. Let's say the split is an 80/20 ratio for the training …
Category: Data Science

Should I perform customer segmentation before performing churn prediction?

Imagine a company with multiple lines of revenues coming from diferent products, but all customer can access these different products through the same account and the same online platform. My goal is to predict the churn for each customer. Should I perform customer segmentation into clusters and build a churn prediction model for each segment? The straight foward path would be to get all relevant features for all customers and try to predict the churn for all of them. The …
Category: Data Science

Predicting churn - deal with missing dates in time series and improve modelling result

This is the follow up question for General approach on time series for customer retention/churn in retail. I have a time series of data in the following form: | purchase_date | cutomer_id | num_purchases | churned | 2018-10-31 id1 39 0 2018-11-31 id1 0 0 2019-01-31 id1 6 0 ... 2019-03-31 id1 88 1 2019-03-31 id2 300 0 2018-04-31 id2 2 1 2019-02-31 id3 1 1 2019-07-31 id4 100 0 ... id5 I grouped the data by month and summed …
Category: Data Science

Data for churning model

I am thinking to improve the imbalanced dataset for my churning model, as most people recommend like over/under sampling. I am wondering if using past customer churn data would be helpful. Say that I am now collecting data for the past 12 months only to start with, and for this purpose I also collect customer churn data from past 12-36 months. Any feedback would be appreciated. Thank you
Topic: churn
Category: Data Science

Churn Prediction Training Set

I don't understand how to form my dataset from activity(logins etc.) and characteristic(location, age etc.) raw user data. Ultimately, each row of the training set will have N activity features for a certain period, M characteristic features and a binary outcome - churn or not after the end of this period. My problem comes from defining the period and the number of rows per users. The options I see are the following: Define period from start of user lifetime, 1 …
Category: Data Science

What sort of analysis should be done in order to define our target outcome for modelling customer lapse?

I am trying to build a model to predict customer lapse and am required to define the target lapse definition myself. What sort of customer behavioural analysis should I do in order to define my modelling target? In other words, what sort of analysis could help me decide between a lapse definition of 90 days without a purchase compared to 180 days.
Category: Data Science

Who will be churned in the next 4 months?

The task is predicting churn for a given time horizon (for example, 4 months or 6 months in the future). The standard approach predicts only that somebody will churn or not. Is there any approach that can solve this problem? Is it necessary to organize features on time basis in order to predict churn in the next period? I have found this short explanation: https://stackoverflow.com/questions/64237069/predicting-customer-churn-over-a-period-of-time but it is not clear, is there connection between prediction period and sliding window? How …
Category: Data Science

How to define churn prediction for period of time in the future (for example 4 months)

Task is churn prediction for customers who pay subscription for the service, in the next 4 months. The customer can pay subscription on monthly or yearly basis. If the customer doesn't pay in subscription period (for monthly basis next month, for yearly basis after 12 months) he receives a warning in the next month, then again second warning (a month after that) and then he is awarded status “churned”. Inputs are the data from data warehouse, one row per month …
Category: Data Science

Churn prediction model doesn't predict good on real data

I am working currently on churn prediction problem. As an input I use data from date warehouse for a period 082016 - 032021(one row per month for each customer). Based on this data I have created a time window of 18 months, where I track customer behaviour(feature engineering). Based on features, I predict churn in 4 months in the future 122020-032021. As a model I use lightGBM with the following parameters: parameters = { 'objective': 'binary', 'metric': 'auc', 'is_unbalance': 'true', …
Topic: lightgbm churn
Category: Data Science

Should I include active services when training a ML model for churn prediction?

I have been trying to build a ML model to predict churn events of our services. The services are subscription based which means they usually have fixed term (1-5 years). And because of that churn usually happens when services are about the expire or already expired (on month-to-month basis). While the churned services are straightforward, I am struggling with sampling for not churned services. The ones that were renewed were initially labeled as 0's. However the ratio of 0 and …
Category: Data Science

What is the best way to model survival when the hazard rate decreases over time?

The standard survival analysis model - for example the model which forms the basis for the proportional hazards model - assumes the hazard rate is constant. In many applications this would be the exception rather than the rule. What parametric model would be appropriate for data such as this: % retention 70% 80% 85% 90% 90%
Category: Data Science

Logistic Regression with Heterogenous Historical Clusters of Customer Usage

I would like to train a churn model based on daily customer usage of a service - among other features - to predict if they are likely to churn. The problem I am facing is that I have historical usage data that vary from a customer to another based on his contract date : some have been subscribers for months, others only for weeks. This means the available historical data varies for each customer. This dataset makes it difficult to …
Category: Data Science

Expected Lifetime: Churn Formula vs. Experience Data

I am analyzing data for a subscription based company. I.e they sell a service in exchange for monthly payment. I would like to conduct an analysis and come up with an estimate of the average lifetime (in months) of a customer. I have approximately 6 years of data including date of enrollment and date of cancellation. N is fairly large 70k, 80k, 100k, 115k, 135k, 161k enrolled clients in the 6 years, respectively. I have seen articles such as this …
Topic: churn
Category: Data Science

Student Churn prediction

I am working on an ML model for student churn prediction. It is a classification problem if some student will churn or not. I have a lot of data like the student data and the activities of the student. There are two problems which I would like to ask about: The churn of the student in the first 6 weeks The overall churn of the student after the 6 weeks Would you split your work between 2 models: in 6 …
Category: Data Science

How to predict churn events that may happen within a period of time?

I am trying to build a model that predicts churn events in the future. The business wants to be able to identify which customers are likely to terminate the services within a month. "Within a month" can mean the next day or the 30th day. The problem is some of the features are time-based, for example how many months into the current term, the number of tickets created in the last two weeks, etc. If the event date is floating, …
Category: Data Science

How to use multiple cross-section observations per subject for churn prediction?

Recently I have started to teach myself about machine learning and I have ran into a dataset, which got me a bit confused. Dataset: The subjects of the dataset are university students (student ID == "Key" feature), and each observation is a summary of their semester (grade averages, ECTS taken and completed, etc.) plus their general programme-related data (enrollment and scholarship status, date of enrollment, programme code, etc.). The data is in hungarian, but in the context of the issue, …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.