How to Approach Linear Machine-Learning Model When Input Variables are Inconsistent

Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals.

I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data.

This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this data easily enough when I force the input values to be consistent, but my approach starts to breakdown and lose fidelity when I utilize the actual experimental data.

For ex: In an ideal setting, each subset of input data will have some samples at [0.0, 10.0, and 20.0] degC. In reality though, I have data subsets where one may be [0.1, 9.9, 19.8] degC and another will be [0.2, 10.1, 20.1] degC. I'm not totally sure how to handle this inconsistency in my input data. In my model I am discretizing the data by coercing the input values to the nearest digit, but this is certainly causing losses with the fidelity of my model.

For context: presently I'm using a the sklearn Linear Regression toolkit, via Python.

MAIN QUESTION

  • Is there a good (recommended) and method [that doesn't require a $20k/ year subscription] for training a linear regression model when the inputs are all unique, and non-repeating?

Topic python-3.x linear-regression scikit-learn dimensionality-reduction

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.