scikit-learn

What are the differences between the below feature selection methods?

Niyaz

2022年6月4日 20:48

Do the below codes do the same? If not, what are the differences? fs = RFE(estimator=RandomForestClassifier(), n_features_to_select=10) fs.fit(X, y) print(fs.support_) fs= RandomForestClassifier(), fs.fit(X, y) print(fs.feature_importances_[:10,])

Topic: scikit-learn feature-selection machine-learning

Category: Data Science

What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?

Allan_Aj5

2022年6月4日 20:02

I am trying to understand what happens inside the IDF part of the TFIDF vectorizer. The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features. I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9) But what does not make sense to me is the IDF …

Topic: scikit-learn nlp

Category: Data Science

Hyper-parameter tuning of NaiveBayes Classier

Sameer Zahid

2022年6月4日 16:33

I'm fairly new to machine learning and I'm aware of the concept of hyper-parameters tuning of classifiers, and I've come across a couple of examples of this technique. However, I'm trying to use NaiveBayes Classifier of sklearn for a task but I'm not sure about the values of the parameters that I should try. What I want is something like this, but for GaussianNB() classifier and not SVM: from sklearn.model_selection import GridSearchCV C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1] gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0] kernel=['rbf','linear'] hyper={'kernel':kernel,'C':C,'gamma':gamma} gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True) gd.fit(X,Y) print(gd.best_score_) print(gd.best_estimator_) …

Topic: hyperparameter-tuning naive-bayes-classifier hyperparameter scikit-learn machine-learning

Category: Data Science

a question about activaton function on my neural network project

First Second

2022年6月4日 16:04

I want to plement a model of neural network using sckitlearn . and I want to know which activation function should I use ? I have 10 input variable and one output . all variable are floats(positive ). and the Output is a pecentage ( 0 to 100). and my model is note linear to the output variable, so i'll creat regression model with one hidden layer!!

Topic: activation-function scikit-learn python

Category: Data Science

Can a custom Transformer be used to transform X and y?

fho

2022年6月3日 09:17

I am working with time series in sklearn, my goal is to have a Pipeline step that replaces each row with a window centered on that row (think convolution). My problem here is that I need all rows (even unlabeled ones) in order to create the windows, but during fitting I want to drop all unlabeled rows. This requires access to both X and y in the transform process. Can this be done with a custom Transformer? From what I …

Topic: scikit-learn

Category: Data Science

Why is my training accuracy decreasing higher degrees of polynomial features?

Apoorv Jain

2022年6月3日 00:10

I am new to Machine Learning and started solving the Titanic Survivor problem on Kaggle. While solving the problem using Logistic Regression I used various models having polynomial features with degree $2,3,4,5,6$ . Theoretically the accuracy on training set should increase with degree however it started decreasing post degree $2$ . The graph is as per below

Topic: classifier logistic-regression accuracy scikit-learn

Category: Data Science

Does it make sense to scale input data with random forest regressor taking two different arrays as input?

Jérémy Talbot-Pâquet

2022年6月2日 23:42

I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data. I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes. Here is an example of input data: Return Volume 0 0.000420 119.447233 1 -0.001093 86.455629 2 0.000277 117.940777 3 0.000256 38.084008 4 0.001275 74.376315 ... 45 …

Topic: feature-scaling random-forest scikit-learn

Category: Data Science

test_train_split with stratify integer overflow

tk78

2022年6月2日 23:07

I'm trying to do a stratified split for a skewed dataset with target variable 'b'. The target variable is a bit value (either 0 or 1). Here's an example: df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': 0}) df.loc[np.random.randint(0, 100000, 1000), 'b'] = 1 tr, ts = train_test_split(df, test_size=.2, stratify=df['b']) print(tr.shape, ts.shape) This code returns the following: (93105, 2) (38, 2) My problem is that the returned train/test arrays do not meet the set split ratio of 20%. My setup: Python 3.7.0 (32bit) …

Topic: scikit-learn python

Category: Data Science

How to train LGBMClassifier using optuna

Kyv

2022年6月2日 12:43

I am trying to use lgbm with optuna for a classification task. Here is my model. from optuna.integration import LightGBMPruningCallback import optuna.integration.lightgbm as lgbm import optuna def objective(trial, X_train, y_train, X_test, y_test): param_grid = { # "device_type": trial.suggest_categorical("device_type", ['gpu']), "n_estimators": trial.suggest_categorical("n_estimators", [10000]), "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True), "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20), "max_depth": trial.suggest_int("max_depth", 3, 12), "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 100, 10000, step=1000), "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5), "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15), "bagging_fraction": trial.suggest_float( "bagging_fraction", 0.2, 0.95, step=0.1 ), "bagging_freq": trial.suggest_categorical("bagging_freq", [1]), …

Topic: multiclass-classification scikit-learn python

Category: Data Science

Improving prediction accuracy with XGBoost

ZJAY

2022年6月2日 05:00

I have a 32x20 matrix for which I am trying to use XGBoost (Regression). I am looping through rows to produce an out of sample forecast. I'm surprised that XGBoost only returns an out of sample error (MAPE) of 3-4%. When I run the data through other algorithms (glmboost, boosted linear model), I get MAPEs around 1.8-2.5%. I'm surprised XGBoost is so deficient. I suspect I am under-optimizing hyperparameters. I include a gridsearch, which I ran below, but the error …

Topic: xgboost scikit-learn

Category: Data Science

how to assign back categorical variables to train and test data after training and testing using inverse_transform?

Nithin Reddy

2022年6月2日 04:02

how to assign back categorical variables to train and test data after training and testing using inverse_transform? Like training and testing, data will have encoded numerical values. So, how to assign back categorical values to those variables to train and test dataset after training and testing? Please help me with this.

Topic: feature-engineering scikit-learn machine-learning

Category: Data Science

Retrive image from from features represented by histograms of oriented gradients

Cordylus

2022年6月1日 16:55

I am using histogram of oriented gradients for image classification using clustering in scikit learn. I am using hog from scikit-image to generate hog from 512x512 grayscale image. Here is an example: fd, hog_image = hog(image, orientations=8, pixels_per_cell=(16, 16), cells_per_block=(1, 1), visualize=True, channel_axis=-1) Where fd is used as features in classification. I wonder if there is a way to retrieve image from fitted coefficients in clustering model, in order to see how features differ between the clusters.(i.e go from fd …

Topic: image-preprocessing hog scikit-learn python clustering

Category: Data Science

Using PCA as features for production

Humpalum Druf

2022年6月1日 04:04

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …

Topic: feature-reduction pca scikit-learn feature-selection

Category: Data Science

IterativeImputer Evaluation

candy bird

2022年5月31日 09:53

I am having a hard time evaluating my model of imputation. I used an iterative imputer model to fill in the missing values in all four columns. For the model on the iterative imputer, I am using a Random forest model, here is my code for imputing: imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0) imp_mean.fit(my_data) my_data_filled= pd.DataFrame(imp_mean.transform(my_data)) my_data_filled.head() My problem is how can I evaluate my model. How can I know if the filled values are right? I used a describe function before …

Topic: wikipedia evaluation scikit-learn pandas python

Category: Data Science

XGBClassifier's predictions are not probabilities with objective='binary:logistic'

João Bravo

2022年5月30日 18:55

I am using a XGBoost's XGBClassifier, a binary 0-1 target, and I am trying to define a custom metric function. It supposedly receives an array of predictions and a DMatrix with the training set according to the XGBoost Tutorials. I have used objective='binary:logistic' in order to get probabilities but the prediction values passed to the custom metric function are not between 0 and 1. They can be like between -3 and 5 and the range of values seems to grow …

Topic: metric probability xgboost scikit-learn classification

Category: Data Science

Laben Encoding for Target Classes: Any Integer or Consecutive Integers from Zero?

Hendrik

2022年5月30日 12:40

I'm handling an very conventional supervised classification task with three (mutually exclusive) target categories (not ordinal ones): class1 class2 class2 class1 class3 And so one. Actually in the raw dataset the actual categories are already represented with integers, not strings like my example, but randomly assigned ones: 5 99 99 5 27 I'm wondering whether it is requested/recommended to re-assign zero-based sequential integers to the classes as labels instead of the ones above like this: 0 1 1 0 2 …

Topic: supervised-learning scikit-learn classification python machine-learning

Category: Data Science

need an explanation of the For Loop in the DBSCAN algorithm Demo

soufi-43

2022年5月30日 10:01

In the following code of the DBSCAN algorithm, as a beginner I need an explanation for what happens to the data in the bottom for loop and why ? Generate sample data import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) Compute DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) …

Topic: matplotlib dbscan scikit-learn python

Category: Data Science

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

LazyEval

2022年5月30日 00:03

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …

Topic: pipelines preprocessing scikit-learn python data-cleaning

Category: Data Science

How to combine nlp and numeric data for a linear regression problem

davidm

2022年5月29日 23:02

I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like: { rating: 3.765, review: `Food was great, staff was friendly`, tables: 30, staff: 15, parking: 20 ... } So …

Topic: tfidf linear-regression scikit-learn nlp

Category: Data Science

For multi-class classification in SGDClassifier how do I tell if it is using one-vs-rest or one-vs-one by default?

Ryan

2022年5月28日 22:01

According to the Geron book, for multi-class classification, SGDClassifier in scikit-learn uses one-vs-rest. But how can I tell which one is used as it doesn't appear to give this information in the help file.

Topic: multiclass-classification scikit-learn

Category: Data Science

About