pipelines

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

LazyEval

2022年5月30日 00:03

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …

Topic: pipelines preprocessing scikit-learn python data-cleaning

Category: Data Science

unsupervised anomaly detection for univariate fast frequency time series data?

user10296606

2022年5月18日 05:07

I have a univariate time series (there is a value for each time sampling) (sampling time: 66.66 micro second, number of samples/sampling time=151) coming from a scala customer This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an …

Topic: pipelines unsupervised-learning anomaly-detection scala time-series

Category: Data Science

Need an example of a custom class whose instance is fed to sklearn Pipeline / make_pipeline to use with GridSearchCV

James

2022年5月16日 21:52

According to sklearn.pipeline.Pipeline documentation, the class whose instance is a pipeline element should implement fit() and transform(). I managed to create a custom class that has these methods and works fine with a single pipeline. Now I want to use that Pipeline object as the estimator argument for GridSearchCV. The latter requires the custom class to have set_params() method, since I want to search over the range of custom instance parameters, as opposed to using a single instance of my …

Topic: pipelines gridsearchcv scikit-learn python machine-learning

Category: Data Science

I cannot work out the benefits of a pipeline over a linear sequence of code instructions

ThomasAJ

2022年5月6日 08:09

I've looked at quite a number of how to 'create a pipeline' instructions, but I have yet to see an explanation of the benefits over what I am showing below. To keep my example code agnostic I'll use simple pseudo-code. So what I've been doing in order to, for example, train a model is... Create functions/methods function get_data(parm1, parm2...) function prepare_data(...) function train_model(...) function test_model(...) Run functions/methods <- this is what I mean by 'linear sequence of code instructions' in …

Topic: pipelines machine-learning

Category: Data Science

Dynamic creation of sklearn pipeline

lazarea

2022年4月28日 16:03

I am trying to create an automatic pipeline builder functionality that takes into account a large set of conditions such as the existence of missing values, the scale of numerical features, etc., and automatically creates a scikit-learn pipeline instead of having to manually create them every time. I'm aware of pipeline.steps.append() functionality that allows to assign new pipeline steps dynamically. However it seems to be not allowed to initialize an empty pipeline to start appending to; doing the following yields …

Topic: pipelines preprocessing scikit-learn machine-learning

Category: Data Science

Can anyone tell me why is my pipeline wrong?

user135091

2022年4月27日 18:45

I am trying to build a pipeline in order to perform GridSearchCV to find the best parameters. I already split the data into train and validation and have the following code: column_transformer = make_pipeline( (OneHotEncoder(categories = cols)), (OrdinalEncoder(categories = X["grade"])), "passthrough") imputer = SimpleImputer(strategy='median') scaler = StandardScaler() model = SGDClassifier(loss='log',random_state=42,n_jobs=-1,warm_start=True) pipeline_sgdlogreg = make_pipeline(imputer, column_transformer, scaler, model) When I perform GridSearchCV I am getting the follwing error: "cannot use median strategy with non-numeric data (...)" I do not understand why am …

Topic: pipelines missing-data encoding python

Category: Data Science

Got some troubles with using OneHotEncoder to multiple categories

83demon

2022年4月15日 01:00

I'm trying to get the final pipeline on the titanic dataset(Example was taken from the 'Hands-on ML' book). from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer num_pipeline = Pipeline([ ('selector', DataFrameSelector(list(df_num))), ('imputer',SimpleImputer(strategy='median', fill_value='num',missing_values=np.nan)), ('std_scaler',StandardScaler()) ]) cat_pipeline = Pipeline([ ('selector', DataFrameSelector(list(df_cat))), ('imputer',SimpleImputer(strategy='most_frequent', fill_value='categorical',missing_values=np.nan)), ('cat_encoder', OneHotEncoder(sparse=False)), ]) from sklearn.pipeline import FeatureUnion full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline), ]) df_prepared = full_pipeline.fit_transform(df) df_prepared.shape df_total = pd.DataFrame(df_prepared, columns=df.columns) df_total Where df_num = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] …

Topic: pipelines one-hot-encoding scikit-learn

Category: Data Science

sklearn predict: IndexingError: ('Too many indexers', 'occurred at index <name>')

Harvey

2022年3月11日 00:06

The goal of what I'm trying to accomplish here is to have the output contain all of the use_cols but the model only be built to calculate on categorical_features. The output will then be used to predict and compare the prediction 'REVIEW_ACTION' to the actual 'REVIEWER_ACTION'. Ignoring for the moment why this is A BAD THING TO DO, can we focus on how to achieve this? use_cols = ['FIRST_NAME', 'LAST_NAME', 'PERSON_STATUS', 'DIVISION_NAME', 'PERSON_TYPE', 'JOB_CHANGE', 'JOB_TRANSFER', 'IDENTIFY_DATE', 'SSO', 'USER_ID', 'ASSET_ID', 'ROLE', …

Topic: pipelines scikit-learn

Category: Data Science

Pass information between pipeline steps in sklearn

lazarea

2022年3月10日 16:56

I am working on a simple text generation problem with LSTMs. To make the preprocessing more compact and reproducible, I decided to implement everything in sklearn fashion, using custom sklearn transformers, and the KerasClassifier from scikeras to wrap the neural network definition in a sklearn-type estimator. It almost works but I can't figure out how to pass information from within a certain custom transformer on to the KerasClassifier estimator. More precisely, for the method that creates the neural network, I …

Topic: pipelines preprocessing scikit-learn machine-learning

Category: Data Science

Custom preprocessing using piplines

spectre

2022年2月8日 05:03

I have searched a lot for this issue but unfortunately came up with nothing. Usually in a ML model, during preprocessing, we use Pipelines and ColumnTransformer to group together preprocessing steps and the algorithm. Now the problem with Pipelines is that it performs the specified preprocessing for all the columns. For example if I specify:- pipeline = Pipeline(steps = [('scale', StandardScaler()), ('encode', OneHotEncoder())]) The above pipeline will apply Standard scaler to all the columns of the dataset and the encoder …

Topic: pipelines preprocessing python machine-learning

Category: Data Science

How to retrain sklearn pipeline with new data?

shobhit_kulshreshtha

2022年2月4日 12:45

I have trained and saved a data processing pipeline and an LGBM regressor on 3 months of historical data. Now I know that I can retrain the LGBM regressor on new data every day by passing my trained model as init_model for .train function. How do I retrain my sklearn pipeline that does the data processing using this new data? One way I can think of is to monitor the feature drift and retrain pipeline for latest 3 months data …

Topic: pipelines scikit-learn python machine-learning

Category: Data Science

Retrieving the ordinal encoding of a variable after it's placed in a pipeline/columntransformer

lostwanderer

2022年1月14日 11:49

I am applying ordinal encoding to a dataset through a column transformer - how can I retrieve the ordinal encoding of a feature (e.g. Area)? from sklearn.datasets import fetch_openml df = fetch_openml(data_id=41214, as_frame=True).frame df df_train, df_test = train_test_split(df, test_size=0.33, random_state=0) dt_preprocessor = ColumnTransformer( [ ( "categorical", OrdinalEncoder(), ["VehBrand", "VehPower", "VehGas", "Area", "Region"], ), ("numeric", "passthrough", ["VehAge", "DrivAge", "BonusMalus","Density"]), ], remainder="drop", ) f_names = ["VehBrand", "VehPower", "VehGas", "Area", "Region", "VehAge", "DrivAge", "BonusMalus", "Density"] dt = Pipeline( [ ("preprocessor", dt_preprocessor), ( "regressor", …

Topic: pipelines encoding python

Category: Data Science

Does sklearn.pipeline have a single mechanism for cross-validation regardless of model API?

user5406764

2022年1月12日 16:39

With a single standard interface (sklearn.pipeline) on top of different regressors, how do I use cross-validation? The example below uses two regressors with different internal cross-validation mechanisms, and I'm trying to figure out the "correct" way to do this without resorting to calling each differently. import catboost as cb import numpy as np from scikeras.wrappers import KerasRegressor import sklearn.pipeline import sklearn.preprocessing # ************************************* # First, the common code # ************************************* # Build x and y # y is a simple …

Topic: pipelines cross-validation scikit-learn

Category: Data Science

Custom vectorizer transformer in sklearn with cross validation

lazarea

2022年1月2日 19:33

I created a custom transformer class called Vectorizer() that inherits from sklearn's BaseEstimator and TransformerMixin classes. The purpose of this class is to provide vectorizer-specific hyperparameters (e.g.: ngram_range, vectorizer type: CountVectorizer or TfidfVectorizer) for the GridSearchCV or RandomizedSearchCV, to avoid having to manually rewrite the pipeline every time we believe a vectorizer of a different type or settings could work better. The custom transformer class looks like this: class Vectorizer(BaseEstimator, TransformerMixin): def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None: super().__init__() self.vectorizer = …

Topic: pipelines cross-validation classification python machine-learning

Category: Data Science

Explaining the logic behind the pipe_line method for cross-validation of imbalance datasets

PwNzDust

2022年1月1日 21:20

Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html There is an explanation of how to use from imblearn.pipeline import make_pipeline in order to perform a cross-validation on an imbalanced dataset while avoiding memory leakage. Here I copy the code used in the notebook linked by the article: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45) rf = RandomForestClassifier(n_estimators=100, random_state=13) imba_pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(n_estimators=100, random_state=13)) cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf) new_params = {'randomforestclassifier__' + key: params[key] for key in params} grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, …

Topic: oversampling pipelines imbalanced-learn methodology class-imbalance

Category: Data Science

How to combine preprocessor/estimator selection with hyperparameter tuning using sklearn pipelines?

lazarea

2021年12月30日 20:45

I'm aware of how to use sklearn.pipeline.Pipeline() for simple and slightly more complicated use cases alike. I know how to set up pipelines for homogeneous as well as heterogeneous data, in the latter case making use of sklearn.compose.ColumnTransformer(). Yet, in practical ML one must oftentimes not only experiment with a large set of model hyperparameters, but also with a large set of potential preprocessor classes and different estimators/models. My question is a dual one: What would be the preferred way …

Topic: pipelines hyperparameter-tuning python machine-learning

Category: Data Science

How to get column names after One Hot Encoding when using Pipelines?

spectre

2021年12月6日 18:02

I am using Pipeline and ColumnTransformer to preprocess the data. Basically I am using them to impute null values, scale the numerical data and finally perform OneHotEncoding. When I fit the ColumnTransformer object to my train and test data the resulting output I get is an Array where the column names are 1, 2, 3, 4,5 and so on. Below is my code:- cat_cols = [cname for cname in train_data1.columns if train_data1[cname].dtype == 'object'] num_cols = [cname for cname in …

Topic: pipelines one-hot-encoding pandas

Category: Data Science

ColumnTransformer worse performance than sklearn pipeline

corianne1234

2021年10月23日 17:42

I have an (unbalanced , binary data) pipeline model consisting of two pipelines (preprocessing and the actual model). Now I wanted to include SimpleImputer into my preprocessing pipeline and because I don't want to apply it to all columns used ColumnTransformer but now I see that the performance with ColumnTransformer is a lot worse than with the sklearn pipeline (AUC before around 0.93 and with ColumnTransformerit's around 0.7). I filled the nan values before the pipeline to check if the …

Topic: pipelines imbalanced-learn xgboost scikit-learn python

Category: Data Science

Deep Neural Network Model in sklearn Pipeline

DataPlug

2021年10月20日 21:52

Is it possible to add a deep neural network model as the estimator/model in an sklearn Pipeline? or is it only possible for ML models as the estimator. For example, can I have a transformation pipeline (that consists of some Imputers or Encoders) then followed by an LSTM or CNN model as the final estimator. If so, can someone guide me as to how to go about creating something like that. (using either resources or examples)

Topic: pipelines scikit-learn neural-network

Category: Data Science

when I only give command 'fit', my class does 'transform' too

greeksalad

2021年10月15日 16:23

I have created 2 classes, first of which is: away_defencePressure_idx = 15 class IterImputer(TransformerMixin): def __init__(self): self.imputer = IterativeImputer(max_iter=10) def fit(self, X, y=None): self.imputer.fit(X) return self def transform(self, X, y=None): imputed = self.imputer.transform(X) X['away_defencePressure'] = imputed[:,away_defencePressure_idx] return X and the second one is home_chanceCreationPassing_idx = 3 class KneighborImputer(TransformerMixin): def __init__(self): self.imputer = KNNImputer(n_neighbors=1) def fit(self, X, y=None): self.imputer.fit(X) return self def transform(self, X, y=None): imputed = self.imputer.transform(X) X['home_chanceCreationPassing'] = imputed[:,home_chanceCreationPassing_idx] return X When I put IterImputer() in a pipeline and …

Topic: pipelines scikit-learn python machine-learning

Category: Data Science

About