Incorrect multi-variate anomaly detection - Isolation Forest Python

My data looks like below. it has 333 rows and 2 columns. Clearly the first row is anomaly. ndf: +----+---------+-------------+ | | ROW_CNT | TOT_SALE | +----+---------+-------------+ | 0 | 45 | 1411.27 | +----+---------+-------------+ | 1 | 47754 | 1596200.68 | +----+---------+-------------+ | 2 | 105894 | 3750304.55 | +----+---------+-------------+ | 3 | 372953 | 14368324.86 | +----+---------+-------------+ | 4 | 389915 | 14899302.85 | +----+---------+-------------+ | 5 | 379473 | 14696309.67 | +----+---------+-------------+ | 6 | 388571 | …
Category: Data Science

Can numerical encoding really replace one-hot encoding?

I am reading these articles (see below), which advocate the use of numerical encoding rather than one hot encoding for better interpretability of feature importance output from ensemble models. This goes against everything I have learnt - won't Python treat nominal features (like cities, car make/model) as ordinal if I encode them into integers? https://krbnite.github.io/The-Quest-for-Blackbox-Interpretability-Take-1/ https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
Category: Data Science

How to apply Stacking cross validation for time-series data?

Normally stacking algorithm uses K-fold cross validation technique to predict oof validation that used for level 2 prediction. In case of time-series data (say stock movement prediction), K-fold cross validation can't be used and time-series validation (one suggested on sklearn lib) is suitable to evaluate the model performance. In this case no prediction shall be made on first fold and no training shall be made on last fold. How do we use stacking algorithm cross validation technique for time-series data?
Category: Data Science

Feature Selection using Stacking Ensemble?

I want to combine some estimators, such as Logistic Regression, Gaussian NB and K-Nearest Neighbors for Features Selection, I tried to use StackingClassifier() estimator to do that, but there is no feature_importances_ attribute for this estimator. Is there any other method to select features combining those classifiers ?? Thank you in advance :)
Category: Data Science

How to apply ensemble clustering method?

I need to use ensemble clustering method by using python in my data set. I already applied k-means clustering by using scikit learn library. I also applied different classification method also find ensemble classification method in scikit-learn. Now I am confused is there any library exist in scikit learn for ensemble clustering or how I can apply ensemble clustering method on my data set?
Category: Data Science

Ensemble of different reservoirs (echo state networks)

Suppose I want to do reservoir computing to classify the input to the proper category (e.g. recognizing a handwritten letter). Ideally, after training a single reservoir and testing it, there would be an output vector y with one value close to 1 and the others close to 0. However, this is not the case in practice, and I don't want to make the reservoir bigger at the moment. I was therefore thinking of combining the predictions of a number of …
Category: Data Science

How to train with cross validation? and which f1 score to choose?

I got similar results in 2 models which consists of similar algorithms. Model 1 with cv=10 has a f1'micro' of 0.941. See code below. Model 2 only train test split (no cv) has f1'micro' 0.953. Now here is my understanding problem. Before I did a Grid-Search to find best hyperparameters. Now I would like to do just a cross validation to train the dataset. Like the red marked in the picture. In the code there is still the Grid Search …
Category: Data Science

Are "Gradient Boosting Machines (GBM)" and GBDT exactly the same thing?

In the category of Gradient Boosting, I find some terms confusing. I'm aware that XGBoost includes some optimization in comparison to conventional Gradient Boosting. But are Gradient Boosting Machines (GBM) and GBDT the same thing? Are they just different names? Apart from GBM/GBDT and XGBoost, are there any other models fall into the category of Gradient Boosting?
Category: Data Science

In XGBoost, how is a leaf index corresponding to the particular leaf node in actual base learner trees?

I've trained a XGBoost model for regression, where the max depth is 2. # Create the ensemble ensemble_size = 200 ensemble = xgb.XGBRegressor(n_estimators=ensemble_size, n_jobs=4, max_depth=2, learning_rate=0.1, objective='reg:squarederror') ensemble.fit(train_x, train_y) I've plotted the first tree in the ensemble: # Plot single tree plot_tree(ensemble, rankdir='LR') Now I retrieve the leaf indices of the first training sample in the XGBoost ensemble model: ensemble.apply(train_x[:1]) # leaf indices in all 200 base learner trees array([[6, 6, 4, 6, 4, 6, 5, 5, 4, 5, 4, …
Category: Data Science

Stacking - Appropriate base and meta models

When implementing stacking for model building and prediction (For example using sklearn's StackingRegressor function) what is the appropriate choice of models for the base models and final meta model? Should weak/linear models be used as the base models and an ensemble model as the final meta model (For example: Lasso, Ridge and ElasticNet as base models, and XGBoost as a meta model). Or should non-linear/ensemble models be used as base models and linear regression as the final meta model (For …
Category: Data Science

What is the difference between ensemble methods and hybrid methods, or is there none?

I have the feeling that these terms often are used as synonyms for one another, however they have the same goal, namely increasing prediction accuracy by combining different algorithms. My question thus is, is there a difference between them? And if so is there some book/paper that explains the difference?
Category: Data Science

What is the form of data used for prediction with generalized stacking ensemble?

I am very confused as to how training data is split and on what data level 0 predictions are made when using generalized stacking. This question is similar to mine, but the answer is not sufficiently clear: How predictions of level 1 models become training set of a new model in stacked generalization. My understanding is that the training set is split, base models trained on one split, and predictions are made on the other. These predictions now become features …
Category: Data Science

How can I improve my model on a very very small dataset?

I am starting as a PhD student and we want to find appropriate materials (with certain qualities) from basic chemical properties like charge, etc. There are a lot of models and datasets in similar works, but since our work is pretty novel, we have to make and test each data sample ourselves. This makes the data acquisition very very slow and very expensive. Our estimated samples will be 10-15 samples for some time, until we can expand it. Now I …
Category: Data Science

ValueError: Graph disconnected: cannot obtain value for tensor Tensor

I'm trying to perform a stacking ensemble of three VGG-16 models, all custom-trained on my personal dataset and having the same input shape. This is the code: input_shape = (256,256,3) model_input = Input(shape=input_shape) def load_all_models(n_models): all_models = list() model_top1 = load_model('weights/vgg16_1.h5') all_models.append(model_top1) model_top2 = load_model('weights/vgg16_2.h5') all_models.append(model_top2) model_top3 = load_model('weights/vgg16_3.h5') all_models.append(model_top3) return all_models n_members = 3 members = load_all_models(n_members) print('Loaded %d models' % len(members)) #perform stacking def define_stacked_model(members): for i in range(len(members)): model = members[i] for layer in model.layers: # make …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.