Do the below codes do the same? If not, what are the differences? fs = RFE(estimator=RandomForestClassifier(), n_features_to_select=10) fs.fit(X, y) print(fs.support_) fs= RandomForestClassifier(), fs.fit(X, y) print(fs.feature_importances_[:10,])
I am trying to understand what happens inside the IDF part of the TFIDF vectorizer. The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features. I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9) But what does not make sense to me is the IDF …
I'm fairly new to machine learning and I'm aware of the concept of hyper-parameters tuning of classifiers, and I've come across a couple of examples of this technique. However, I'm trying to use NaiveBayes Classifier of sklearn for a task but I'm not sure about the values of the parameters that I should try. What I want is something like this, but for GaussianNB() classifier and not SVM: from sklearn.model_selection import GridSearchCV C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1] gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0] kernel=['rbf','linear'] hyper={'kernel':kernel,'C':C,'gamma':gamma} gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True) gd.fit(X,Y) print(gd.best_score_) print(gd.best_estimator_) …
I want to plement a model of neural network using sckitlearn . and I want to know which activation function should I use ? I have 10 input variable and one output . all variable are floats(positive ). and the Output is a pecentage ( 0 to 100). and my model is note linear to the output variable, so i'll creat regression model with one hidden layer!!
I am working with time series in sklearn, my goal is to have a Pipeline step that replaces each row with a window centered on that row (think convolution). My problem here is that I need all rows (even unlabeled ones) in order to create the windows, but during fitting I want to drop all unlabeled rows. This requires access to both X and y in the transform process. Can this be done with a custom Transformer? From what I …
I am new to Machine Learning and started solving the Titanic Survivor problem on Kaggle. While solving the problem using Logistic Regression I used various models having polynomial features with degree $2,3,4,5,6$ . Theoretically the accuracy on training set should increase with degree however it started decreasing post degree $2$ . The graph is as per below
I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data. I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes. Here is an example of input data: Return Volume 0 0.000420 119.447233 1 -0.001093 86.455629 2 0.000277 117.940777 3 0.000256 38.084008 4 0.001275 74.376315 ... 45 …
I'm trying to do a stratified split for a skewed dataset with target variable 'b'. The target variable is a bit value (either 0 or 1). Here's an example: df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': 0}) df.loc[np.random.randint(0, 100000, 1000), 'b'] = 1 tr, ts = train_test_split(df, test_size=.2, stratify=df['b']) print(tr.shape, ts.shape) This code returns the following: (93105, 2) (38, 2) My problem is that the returned train/test arrays do not meet the set split ratio of 20%. My setup: Python 3.7.0 (32bit) …
I have a 32x20 matrix for which I am trying to use XGBoost (Regression). I am looping through rows to produce an out of sample forecast. I'm surprised that XGBoost only returns an out of sample error (MAPE) of 3-4%. When I run the data through other algorithms (glmboost, boosted linear model), I get MAPEs around 1.8-2.5%. I'm surprised XGBoost is so deficient. I suspect I am under-optimizing hyperparameters. I include a gridsearch, which I ran below, but the error …
how to assign back categorical variables to train and test data after training and testing using inverse_transform? Like training and testing, data will have encoded numerical values. So, how to assign back categorical values to those variables to train and test dataset after training and testing? Please help me with this.
I am using histogram of oriented gradients for image classification using clustering in scikit learn. I am using hog from scikit-image to generate hog from 512x512 grayscale image. Here is an example: fd, hog_image = hog(image, orientations=8, pixels_per_cell=(16, 16), cells_per_block=(1, 1), visualize=True, channel_axis=-1) Where fd is used as features in classification. I wonder if there is a way to retrieve image from fitted coefficients in clustering model, in order to see how features differ between the clusters.(i.e go from fd …
I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …
I am having a hard time evaluating my model of imputation. I used an iterative imputer model to fill in the missing values in all four columns. For the model on the iterative imputer, I am using a Random forest model, here is my code for imputing: imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0) imp_mean.fit(my_data) my_data_filled= pd.DataFrame(imp_mean.transform(my_data)) my_data_filled.head() My problem is how can I evaluate my model. How can I know if the filled values are right? I used a describe function before …
I am using a XGBoost's XGBClassifier, a binary 0-1 target, and I am trying to define a custom metric function. It supposedly receives an array of predictions and a DMatrix with the training set according to the XGBoost Tutorials. I have used objective='binary:logistic' in order to get probabilities but the prediction values passed to the custom metric function are not between 0 and 1. They can be like between -3 and 5 and the range of values seems to grow …
I'm handling an very conventional supervised classification task with three (mutually exclusive) target categories (not ordinal ones): class1 class2 class2 class1 class3 And so one. Actually in the raw dataset the actual categories are already represented with integers, not strings like my example, but randomly assigned ones: 5 99 99 5 27 I'm wondering whether it is requested/recommended to re-assign zero-based sequential integers to the classes as labels instead of the ones above like this: 0 1 1 0 2 …
In the following code of the DBSCAN algorithm, as a beginner I need an explanation for what happens to the data in the bottom for loop and why ? Generate sample data import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) Compute DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) …
I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …
I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like: { rating: 3.765, review: `Food was great, staff was friendly`, tables: 30, staff: 15, parking: 20 ... } So …
According to the Geron book, for multi-class classification, SGDClassifier in scikit-learn uses one-vs-rest. But how can I tell which one is used as it doesn't appear to give this information in the help file.