Im trying to use gridsearch to find the best parameter for my model. Knowing that I have to implement nearmiss undersampling method while doing cross validation, should I fit my gridsearch on my undersampled dataset (no matter which under sampling techniques) or on my entire training data (whole dataset) before using cross validation?
I have a multi-label dataset, whose label distribution looks something like this, with label on x-axis and number of rows it occurs in the dataset in y-axis. ## imports import numpy as np import pandas as pd %matplotlib inline from sklearn.datasets import make_multilabel_classification ## creating dummy data X, y = make_multilabel_classification(n_samples=100_000, n_features=2, n_classes=100, n_labels=10, random_state=42) X.shape, y.shape ((100000, 2), (100000, 100)) ## making it a dataframe final_df = pd.merge(left=pd.DataFrame(X), right=pd.DataFrame(y), left_index=True, right_index=True).copy() final_df.rename(columns={'0_x':'input_1', '1_x':'input_2', '0_y':0, '1_y':1}, inplace=True) final_df.columns = final_df.columns.astype(str) …
I have a multilabel dataset showing an extreme case of imbalance. I was thinking of clustering the less populated classes into bigger clusters of size at least N. My question: is there a clustering algorithm that allows one to merge together only the smaller, similar groups into clusters of size at least N? The idea is that the algorithm should ignore those labels that are already populated enough and focus on clustering together labels that are still underrepresented. To give …
I'm using SimpleTranformers to train and evaluate a model. Since the dataset I am using is severely imbalanced, it is recommended that I assign weights to each label. An example of assigning weights for SimpleTranformers is given here. My question, however, is: How exactly do I choose what's the appropriate weight for each class? Is there a specific methodology, e.g., a formula that uses the ratio of the labels? Follow-up question: Are the weights used for the same dataset "universal"? …
What is the best way to categorize the approaches which have been developed to deal with imbalance class problem? This article categorizes them into: Preprocessing: includes oversampling, undersampling and hybrid methods, Cost-sensitive learning: includes direct methods and meta-learning which the latter further divides into thresholding and sampling, Ensemble techniques: includes cost-sensitive ensembles and data preprocessing in conjunction with ensemble learning. The second classification: Data Pre-processing: includes distribution change and weighting the data space. One-class learning is considered as distribution change. …
I used XGBoost to predict company's bankruptcy, which is an extremely unbalanced dataset. Although I tried weighting method as well as parameter tuning, the best result which I could obtain is as follows: Best Parameters: {'clf__gamma': 0.1, 'clf__scale_pos_weight': 30.736842105263158, 'clf__min_child_weight': 1, 'clf__max_depth': 9} Best Score: 0.219278428798 Accuracy: 0.966850828729 AUC: 0.850038850039 F1 Measure: 0.4 Cohen Kappa: 0.383129792673 Precision: 0.444444444444 recall: 0.363636363636 Confusion Matrix: [[346 5] [ 7 4]] As the confusion matrix shows my model can not identify bankrupted companies very …
I am using Orange to predict customer churn and compare different learners based on accuracy, F1, etc. As my problem is unbalanced (10% churn - 90% not churn), I want to oversample. However, when using orange, this is not possible to do the oversampling within the cross-validation (test & score block). Therefore, I want to, based on my input data, generate first 10 folds (stratified - where the distribution 10 % churn / 90 % not churn) is preserved. Then, …
I confront with a binary classification machine learning task which is both slightly imbalanced and cost sensitive. I wonder what (and where in the modeling pipeline, say, in sklearn) is the best way to take all these considerations into account. Class proportionality: positive: 0.25% negative: 0.75%. This could be addressed with sklearn.utils.class_weigh.compute_class_weight: class_weights = compute_class_weight(y=y, class_weight='balanced') OK, but this is only for rebalancing proportionalty, I should take misclassification cost into consideration as well. Let's say that this is 10* larger …
I have a problem that is about identifying clusters of highly correlated items. I initially focused on building a model and features that put similar data items close to each other. The main challenge is that I have a case of imbalanced data, as follows: Tens of Millions of items are random and not necessarily correlated. Hundreds of clusters of items (composed of 10-1000s of elements) exist* or may emerge. *I do have partial ground truth for the existing ones. …