What are some good methods to forecast future revenue on categorical and value based data?

I have monthly snapshots (3 years) of all the contract data. It includes following information:

  • Contract status [Categorical]: Proposed, tracked, submitted, won, lost, etc
  • Contract stages [Categorical]: Prospecting, engaged, tracking, submitted, etc.
  • Duration of contract [Date/Time] : months and years
  • Bid Start date [Date/Time]: Date (But this changes when the contracts are delayed)
  • Contract value [Numerical] : Value of the contract in local currency
  • Future revenue projection [Numerical]: Currency value breakdown of revenue for next 5 years (this value is available for all the contracts, no matter if it's won or lost)

I also have other information about the contracts like id, name, description, etc.

Answers I am trying to get:

  • Total value of contracts that are changing status from month to month
  • Total value of contracts that are changing stages from month to month
  • Average delay of the start date of the contracts
  • Future revenue projection (5 years) based on change of status and average delay

Problems I am having with this data:

  • It's not time series data, it's monthly snapshot, so I can either turn it into monthly time series dataset and accumulate revenues based on each status and stages or count of all the contracts.

  • Do I accumulate the contracts data or leave it as individual contracts? In the later case, how do I feed it to any model? It won't be a time series data then.

Main problem with finding the right approach:

  • I am not sure what approaches to use to answer very different questions. Some values are categorical and some are numerical. I am not sure if it is a forecasting problem or 'change in event' prediction problem. Or mix of both?

  • How do I incorporate, these very different categorical variables with numerical revenue value, into any model.

Methods I looked into:

I am sorry for the long post. I am trying to make the problem as clear as possible. I have no idea what would be the best approach to solve this problem and what data to feed to the model. I am also lost at how to structure the data to get the best use out of all the variables.

I would be grateful if you can help me suggest any good methods or reading resources, so I can answer the questions. Thank you for your time!

Topic forecasting time-series feature-extraction categorical-data machine-learning

Category Data Science


For time series forecasting based on both numerical and categorical data, Light GBM has proven its value in Kaggle competitions. The winners of both the M5 competition and the Corporación Favorita Grocery Sales Forecasting competition used Light GBM.


If you want to use neural networks this post on Kaggle might help: https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/159052

It has a short list of resources for categorical embeddings and LSTM (I think).

If you think your dataset has periodic patterns, and you only need to answer your questions (not deploy a model). I would take a look a FB Prophet: https://facebook.github.io/prophet/docs/quick_start.html#python-api

It extracts the periodic components and fits them with sine and cosine waves. You can also add additional regressors, i.e., one-hot encoded categorical variables.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.