Is it possible to optimize the Client Lifetime Value with Reinforcement Learning using Marketing Activities as Actions?

I have been researching the Reinforcement Learning topic. I have been looking if this is the correct way to optimize the marketing actions of my company given that we are looking to optimize the Client's Lifetime Value. So far I have found this:

  • The environment is the real-life o.O
  • Agent is our company
  • The rewards/values can be matched to the CLV itself, meaning that any action could lead to improve or worsen the CLV
  • The actions are the possible marketing and non-marketing actions that could be: offer product A, B, invite the client to download our app, not-doing anything, and so on.
  • The state could be the actual portfolio, previous actions, etc

My main worry is that some actions do not lead immediately to improve the CLV. For instance, making our clients download the app could lead to improvement in the CLV but it is not immediate, or even traceable, but this download could lead to having a better probability of acceptance for the products. We have the probability of acceptance for the products. Will a RL model be capable of help us improve our decisions? (Any more information which I did not specified, please let me know)

Topic marketing reinforcement-learning

Category Data Science


I believe that your general problem is a sequential problem and it could be tackled with RL or other optimization methods. As you suggested the simulator (environment) is real life therefore you should look at offline RL methods (recent review here). The main objective of offline RL is to tackle the problem of applying RL in real life without having a simulator and use collected data. In other words, we want to avoid the risk of deploying a RL algorithm that hasn't interacted with the real MDP.

Additionally, I would suggest you as a starting point to do some analysis. For example, detect which actions immediately have impact to the CLV and which are not. Which features are important? You could formulate the problem as a Supervised Learning problem (features --> improve/don't improve/same). You could do some clustering of your customers depending on the features.

You might be able to come up with a custom CLV to optimize and run small scale experiments without the need of learning algorithms and you could potentially use as baselines later on. Then you could also run small experiments in Bandit settings (no sequential aspect of your problem here as credit assignment is tough!).

You can do all these on parallel with your exploration in offpolicy RL methods. Please note that eventually careful feature selection, problem formulation and data analysis are keys to the success of your implementation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.