What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?

I am trying to understand what happens inside the IDF part of the TFIDF vectorizer. The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features. I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9) But what does not make sense to me is the IDF …
Category: Data Science

Hyper-parameter tuning of NaiveBayes Classier

I'm fairly new to machine learning and I'm aware of the concept of hyper-parameters tuning of classifiers, and I've come across a couple of examples of this technique. However, I'm trying to use NaiveBayes Classifier of sklearn for a task but I'm not sure about the values of the parameters that I should try. What I want is something like this, but for GaussianNB() classifier and not SVM: from sklearn.model_selection import GridSearchCV C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1] gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0] kernel=['rbf','linear'] hyper={'kernel':kernel,'C':C,'gamma':gamma} gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True) gd.fit(X,Y) print(gd.best_score_) print(gd.best_estimator_) …
Category: Data Science

a question about activaton function on my neural network project

I want to plement a model of neural network using sckitlearn . and I want to know which activation function should I use ? I have 10 input variable and one output . all variable are floats(positive ). and the Output is a pecentage ( 0 to 100). and my model is note linear to the output variable, so i'll creat regression model with one hidden layer!!
Category: Data Science

Can a custom Transformer be used to transform X and y?

I am working with time series in sklearn, my goal is to have a Pipeline step that replaces each row with a window centered on that row (think convolution). My problem here is that I need all rows (even unlabeled ones) in order to create the windows, but during fitting I want to drop all unlabeled rows. This requires access to both X and y in the transform process. Can this be done with a custom Transformer? From what I …
Topic: scikit-learn
Category: Data Science

Why is my training accuracy decreasing higher degrees of polynomial features?

I am new to Machine Learning and started solving the Titanic Survivor problem on Kaggle. While solving the problem using Logistic Regression I used various models having polynomial features with degree $2,3,4,5,6$ . Theoretically the accuracy on training set should increase with degree however it started decreasing post degree $2$ . The graph is as per below
Category: Data Science

Does it make sense to scale input data with random forest regressor taking two different arrays as input?

I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data. I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes. Here is an example of input data: Return Volume 0 0.000420 119.447233 1 -0.001093 86.455629 2 0.000277 117.940777 3 0.000256 38.084008 4 0.001275 74.376315 ... 45 …
Category: Data Science

test_train_split with stratify integer overflow

I'm trying to do a stratified split for a skewed dataset with target variable 'b'. The target variable is a bit value (either 0 or 1). Here's an example: df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': 0}) df.loc[np.random.randint(0, 100000, 1000), 'b'] = 1 tr, ts = train_test_split(df, test_size=.2, stratify=df['b']) print(tr.shape, ts.shape) This code returns the following: (93105, 2) (38, 2) My problem is that the returned train/test arrays do not meet the set split ratio of 20%. My setup: Python 3.7.0 (32bit) …
Category: Data Science

How to train LGBMClassifier using optuna

I am trying to use lgbm with optuna for a classification task. Here is my model. from optuna.integration import LightGBMPruningCallback import optuna.integration.lightgbm as lgbm import optuna def objective(trial, X_train, y_train, X_test, y_test): param_grid = { # "device_type": trial.suggest_categorical("device_type", ['gpu']), "n_estimators": trial.suggest_categorical("n_estimators", [10000]), "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True), "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20), "max_depth": trial.suggest_int("max_depth", 3, 12), "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 100, 10000, step=1000), "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5), "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15), "bagging_fraction": trial.suggest_float( "bagging_fraction", 0.2, 0.95, step=0.1 ), "bagging_freq": trial.suggest_categorical("bagging_freq", [1]), …
Category: Data Science

Improving prediction accuracy with XGBoost

I have a 32x20 matrix for which I am trying to use XGBoost (Regression). I am looping through rows to produce an out of sample forecast. I'm surprised that XGBoost only returns an out of sample error (MAPE) of 3-4%. When I run the data through other algorithms (glmboost, boosted linear model), I get MAPEs around 1.8-2.5%. I'm surprised XGBoost is so deficient. I suspect I am under-optimizing hyperparameters. I include a gridsearch, which I ran below, but the error …
Category: Data Science

how to assign back categorical variables to train and test data after training and testing using inverse_transform?

how to assign back categorical variables to train and test data after training and testing using inverse_transform? Like training and testing, data will have encoded numerical values. So, how to assign back categorical values to those variables to train and test dataset after training and testing? Please help me with this.
Category: Data Science

Retrive image from from features represented by histograms of oriented gradients

I am using histogram of oriented gradients for image classification using clustering in scikit learn. I am using hog from scikit-image to generate hog from 512x512 grayscale image. Here is an example: fd, hog_image = hog(image, orientations=8, pixels_per_cell=(16, 16), cells_per_block=(1, 1), visualize=True, channel_axis=-1) Where fd is used as features in classification. I wonder if there is a way to retrieve image from fitted coefficients in clustering model, in order to see how features differ between the clusters.(i.e go from fd …
Category: Data Science

Using PCA as features for production

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …
Category: Data Science

IterativeImputer Evaluation

I am having a hard time evaluating my model of imputation. I used an iterative imputer model to fill in the missing values in all four columns. For the model on the iterative imputer, I am using a Random forest model, here is my code for imputing: imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0) imp_mean.fit(my_data) my_data_filled= pd.DataFrame(imp_mean.transform(my_data)) my_data_filled.head() My problem is how can I evaluate my model. How can I know if the filled values are right? I used a describe function before …
Category: Data Science

XGBClassifier's predictions are not probabilities with objective='binary:logistic'

I am using a XGBoost's XGBClassifier, a binary 0-1 target, and I am trying to define a custom metric function. It supposedly receives an array of predictions and a DMatrix with the training set according to the XGBoost Tutorials. I have used objective='binary:logistic' in order to get probabilities but the prediction values passed to the custom metric function are not between 0 and 1. They can be like between -3 and 5 and the range of values seems to grow …
Category: Data Science

Laben Encoding for Target Classes: Any Integer or Consecutive Integers from Zero?

I'm handling an very conventional supervised classification task with three (mutually exclusive) target categories (not ordinal ones): class1 class2 class2 class1 class3 And so one. Actually in the raw dataset the actual categories are already represented with integers, not strings like my example, but randomly assigned ones: 5 99 99 5 27 I'm wondering whether it is requested/recommended to re-assign zero-based sequential integers to the classes as labels instead of the ones above like this: 0 1 1 0 2 …
Category: Data Science

need an explanation of the For Loop in the DBSCAN algorithm Demo

In the following code of the DBSCAN algorithm, as a beginner I need an explanation for what happens to the data in the bottom for loop and why ? Generate sample data import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) Compute DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) …
Category: Data Science

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …
Category: Data Science

How to combine nlp and numeric data for a linear regression problem

I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like: { rating: 3.765, review: `Food was great, staff was friendly`, tables: 30, staff: 15, parking: 20 ... } So …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.