change target variable value to reflect better affordability
Context
I am working on a regression problem trying to predict affordability. My dataset contains daily installments repaying a purchase in a form of contract. Essentially, a minimum daily rate the customer has to pay for their purchase. Using this data I want to predict the affordability of each customer. My target variable is the daily rate of the last purchase and the features take into account all the payments and similar purchases up that point in time. In this problem affordability is: the amount of additional daily rate a customer can add to their existing daily rate and can keep repaying consistently. Problem A hypothetical issue arises when a customer pays a lot of money(some times the entire amount) upfront and the daily rate on contract is very low. This customer would have the same target
Example
customer_id | total_paid | previous_daily_rate | days_since_first_purchase | last_purchase_daily_rate(target) |
---|---|---|---|---|
123 | 200 | 0.2 | 5 | 0.1 |
321 | 200 | 0.2 | 1100 | 0.1 |
There are many more features but I only display those for simplicity
From the sample data above we can see that the customer 123
paid the first 5 days 200 when was only expected to pay 1$(days*daily_rate). The customer 321
, on the other hand, paid the same amount over the course of 1100 days but he actually expected to pay 220.
- Is this a problem for regression?
- Any leads on how to tackle this issue?
Potential solution
In order to tackle this, I am investigating a way of inflating the target variable for those case that the probable buying power is obviously higher to what we see in the data. However, I can only come up with arbitrary, hardcoded formulas. The logic of my existing solution is the following:
if total_paid/days_since_purchase(aka *contract performance%*) = 200
and last_daily_rate(target) some_value
then
if last_daily_rate(target) * inflation = cap
inflated_target = cap
else inflated_target = last_daily_rate(target) * inflation
This already gives better results compared to leaving the target variable intact. However, all the threshold and logic are arbitrary. I was wondering if the community had come across any similar problem and if yes if there is a more robust approach to solving these type of problems.
Edit:
alternative solutions for customers like customer_id 123
:
- consider them outliers and remove them
- Convert it to classification problem and manually assign those customers to a higher tier(class) of affordability
Topic finance regression data-cleaning
Category Data Science