change target variable value to reflect better affordability

Context

I am working on a regression problem trying to predict affordability. My dataset contains daily installments repaying a purchase in a form of contract. Essentially, a minimum daily rate the customer has to pay for their purchase. Using this data I want to predict the affordability of each customer. My target variable is the daily rate of the last purchase and the features take into account all the payments and similar purchases up that point in time. In this problem affordability is: the amount of additional daily rate a customer can add to their existing daily rate and can keep repaying consistently. Problem A hypothetical issue arises when a customer pays a lot of money(some times the entire amount) upfront and the daily rate on contract is very low. This customer would have the same target

Example

customer_id total_paid previous_daily_rate days_since_first_purchase last_purchase_daily_rate(target)
123 200 0.2 5 0.1
321 200 0.2 1100 0.1

There are many more features but I only display those for simplicity

From the sample data above we can see that the customer 123 paid the first 5 days 200 when was only expected to pay 1$(days*daily_rate). The customer 321, on the other hand, paid the same amount over the course of 1100 days but he actually expected to pay 220.

  1. Is this a problem for regression?
  2. Any leads on how to tackle this issue?

Potential solution

In order to tackle this, I am investigating a way of inflating the target variable for those case that the probable buying power is obviously higher to what we see in the data. However, I can only come up with arbitrary, hardcoded formulas. The logic of my existing solution is the following:

if total_paid/days_since_purchase(aka *contract performance%*) = 200
and last_daily_rate(target)  some_value
then
   if last_daily_rate(target) * inflation = cap
   inflated_target = cap
   else inflated_target = last_daily_rate(target) * inflation

This already gives better results compared to leaving the target variable intact. However, all the threshold and logic are arbitrary. I was wondering if the community had come across any similar problem and if yes if there is a more robust approach to solving these type of problems.

Edit:

alternative solutions for customers like customer_id 123 :

  1. consider them outliers and remove them
  2. Convert it to classification problem and manually assign those customers to a higher tier(class) of affordability

Topic finance regression data-cleaning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.