The Accuracy before ovesampling : On Training : 98,54% On Testing : 98,21% The Accuracy after ovesampling : On Training : 77,92% On Testing : 90,44% What does mean this and how to increase the accuracy ? Edit: Classes before SMOTE: dataset['Label'].value_counts() BENIGN 168051 Brute Force 1507 XSS 652 Sql Injection 21 Classes after SMOTE: BENIGN 117679 Brute Force 117679 XSS 117679 Sql Injection 117679 I used the following model: -Random Forest : Train score : 0.49 Test score: 0.85 …
I'm building a binary text classifier, the ratio between the positives and negatives is 1:100 (100 / 10000). By using back translation as an augmentation, I was able to get 400 more positives. Then I decided to do up sampling to balance the data. Do I include only the positive data points (100) or should I also include the 400 that I have generated? I will definitely try both, but I wanted to know if there is any rule of …
I had a question related to SMOTE. If you have a data set that is imbalanced, is it correct to use SMOTE when you are using BERT? I believe I read somewhere that you do not need to do this since BERT take this into account, but I'm unable to find the article where I read that. Either from your own research or experience, would you say that oversampling using SMOTE (or some other algorithm) is useful when classifying using …
Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html There is an explanation of how to use from imblearn.pipeline import make_pipeline in order to perform a cross-validation on an imbalanced dataset while avoiding memory leakage. Here I copy the code used in the notebook linked by the article: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45) rf = RandomForestClassifier(n_estimators=100, random_state=13) imba_pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(n_estimators=100, random_state=13)) cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf) new_params = {'randomforestclassifier__' + key: params[key] for key in params} grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, …
For some classification needs. I have multivariate time series data composed from 4 stelite images in form of (145521 pixels, 4 dates, 2 bands) I made a classification with tempCNN to classify the data into 5 classes. However there is a big gap between the class 1,2 with 500 samples and 4,5 with 1452485 samples. I' am wondering if there is a method that help me oversamling the two first classes to make my dataset more adequate for classification.
I have 5 classes, one of them having only one sample. I've been researching techniques to oversample such as SMOTE and Bootstrapping but they do not work for the class with only one sample. I am considering repetition of this class. Are there any other strategies you would recommend? Would repetition followed by SMOTE make sense or not really? Due to the nature of SMOTE using k-nearest neighbors?
I'm using a multiclass dataset (cic-ids-2017), which is very imbalanced. I have already encoded the categorical feature (which is the target) using OneHotEncoder. I tried to use SMOTE oversampling method to balance the data with pipeline: X = df.drop(['Label'],1) y = df.Label steps = [('onehot', OneHotEncoder()), ('smt', SMOTE())] pipeline = Pipeline(steps=steps) X, y = pipeline.fit_resample(X, y) When I used pd.get_dummies instead of OneHotEncoder, in this case I could not use the pipeline (because of get_dummies). How can I balance the …
I am using with a tiny private dataset (over 192 samples) with 4 classes. A preprocessing step is trivial in order to do any classification. Among feature selection and extraction techniques, i decided to apply oversampling (SMOTE). Here is what i did: Using the entire dataset (original 192 samples): Create synthetic samples for each class using SMOTE, so i get a total of 500 samples per class (2000 in total) I have a big suspicion about this procedure because when …