I cannot work out the benefits of a pipeline over a linear sequence of code instructions

Question

I cannot work out the benefits of a pipeline over a linear sequence of code instructions

ThomasAJ

2022年5月6日 08:09

I've looked at quite a number of how to 'create a pipeline' instructions, but I have yet to see an explanation of the benefits over what I am showing below.

To keep my example code agnostic I'll use simple pseudo-code.

So what I've been doing in order to, for example, train a model is...

Create functions/methods

function get_data(parm1, parm2...)

function prepare_data(...)

function train_model(...)

function test_model(...)

Run functions/methods - this is what I mean by 'linear sequence of code instructions' in the Title.

get_data(parm1, parm2...)
prepare_data(...)
train_model(...)
function test_model(...)

Topic pipelines machine-learning

Category Data Science

karteek menda · Accepted Answer · 2022年5月6日 08:09

In general the preporocessing must be done after the train test split. So, always fit and transform on train data and transform on test data. In order to avoid DATA LEAKAGE we specifically use Pipelines. It is a sequentially application of a list of actions. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

Erwan · Accepted Answer · 2022年4月27日 13:41

As far as I know, in supervised ML the main advantage of a pipeline is to make sure the exact same steps of preprocessing are applied on the training set and the test set.

Preprocessing can be complex sometimes: it potentially includes representation, normalization, removing outliers or imputing missing values, possibly calculating new features or removing some, etc. By doing this separately for the training set and the test set, there's the usual risk of code duplication: if something is modified here and not there, this can introduce inconsistencies and bugs. Additionally a model could easily go completely wrong if the test data is not represented in the same as the training data, so this would especially critical in supervised learning.

I cannot work out the benefits of a pipeline over a linear sequence of code instructions

About