I cannot work out the benefits of a pipeline over a linear sequence of code instructions

I've looked at quite a number of how to 'create a pipeline' instructions, but I have yet to see an explanation of the benefits over what I am showing below.

To keep my example code agnostic I'll use simple pseudo-code.

So what I've been doing in order to, for example, train a model is...

Create functions/methods

function get_data(parm1, parm2...)

function prepare_data(...)

function train_model(...)

function test_model(...)

Run functions/methods - this is what I mean by 'linear sequence of code instructions' in the Title.

get_data(parm1, parm2...)
prepare_data(...)
train_model(...)
function test_model(...)

Topic pipelines machine-learning

Category Data Science


In general the preporocessing must be done after the train test split. So, always fit and transform on train data and transform on test data. In order to avoid DATA LEAKAGE we specifically use Pipelines. It is a sequentially application of a list of actions. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.


As far as I know, in supervised ML the main advantage of a pipeline is to make sure the exact same steps of preprocessing are applied on the training set and the test set.

Preprocessing can be complex sometimes: it potentially includes representation, normalization, removing outliers or imputing missing values, possibly calculating new features or removing some, etc. By doing this separately for the training set and the test set, there's the usual risk of code duplication: if something is modified here and not there, this can introduce inconsistencies and bugs. Additionally a model could easily go completely wrong if the test data is not represented in the same as the training data, so this would especially critical in supervised learning.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.