Multiclass classification oob error

Im implementing a random forrest for a 6 class classification and witnessing a strange phenomenon. I have 10 percent of my set sectioned out as a pseudo validation set. Im training 50 percent of the training items (training items being 90 percent of the whole set) per tree randomly selected. Now my oob error is almost the mirror image of my validation error. Im using averaged f1 error (ie average of the f1 error per class). As more trees are …
Category: Data Science

xgboost performance

XGBoostRegressor is not performing better than AdaBoostRegressor for the same set of parameters for some reason. Since my dataset is big, I made an example using sklearn's make_regression as follows. from sklearn.ensemble import AdaBoostRegressor from sklearn.datasets import make_regression from sklearn.tree import DecisionTreeRegressor from sklearn.linear_model import LinearRegression from xgboost import XGBRegressor X, y = make_regression(n_samples=10000, n_features=1, n_informative=1, random_state=0, noise=1,shuffle=False) regr = LinearRegression() regr.fit(X, y) print regr.score(X, y) regr = AdaBoostRegressor(DecisionTreeRegressor(max_depth=6),n_estimators=10,random_state=0) regr.fit(X, y) print regr.score(X, y) regr = XGBRegressor(max_depth=6,n_estimators=10,random_state=0) regr.fit(X, y) print …
Category: Data Science

Why can't we sample without replacement for each tree in a random forest if the subsample size is large enough?

Usually if we have $n$ observations, for each tree with form a bootstrapped subsample of size $n$ with replacement. On googling it one common explanation I've seen is that with replacement sampling is necessary for independence of individual trees. But why can't we just resample as follows: for tree 1, randomly sample $m$ observations without replacement out of the $n$, where $m$ is still large enough (of course, provided that $n$ is large enough in the first place). Then replenish …
Category: Data Science

Difference between bagging and pasting?

I found the definition: Bagging is to use the same training for every predictor, but to train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting. What is "replacement" in this context?
Category: Data Science

Base model in ensemble learning

I've been doing some research on ensemble learning and read that for base models, model with high variance are often recommended (can't remember which book I read this from exactly). But, it seems counter-intuitive because wouldn't having base models with low variance(doing good on test set) be better than having multiple bad base models?
Category: Data Science

Bagging vs pasting in ensemble learning

This is a citation from "Hands-on machine learning with Scikit-Learn, Keras and TensorFlow" by Aurelien Geron: "Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble’s variance is reduced." I can't understand why bagging, as compared to pasting, results in higher bias and lower variance. Can anyone provide an intuitive …
Category: Data Science

bagging vs. pasting in ensemble learning

I am bit confused about two concepts. From my understanding Bagging is when each data is replaced after each choice. so for example for each subset of data you pick one from population, replace it then pick one again, etc... and this is repeated for each subset of data. But for pasting people say it is sampling without replacement however does that mean you can't have same data on any subset? I thought it picks one subset w/o replacement but …
Category: Data Science

Bagging Base models

If bagging reduces overfitting than the general statement that base learners of ensemble models should have high bias and low variance(that is should be undefiting) wrong?
Topic: bagging
Category: Data Science

Why the accuracy of my bagging model heavily affected by random state?

The accuracy of my bagging decision tree model reach up to 97% when I set the random seed=5 but the accuracy reduce to only 92% when I set random seed=0. Can someone explain why the huge gap and should I just use the accuracy with highest value in my research paper or takes the average with random seed=None?
Category: Data Science

Counting the number of trainable parameters in a gradient boosted tree

I recently ran the gradient boosted tree regressor using scikit-learn via: GradientBoostingRegressor() This model depends on the following hyperparameters: Estimators ($N_1$) Min Samples Leaf ($N_2$) Max Depth ($N_3$) which in-turn determine the number of trainable parameters in this model. My question is, how can I count the number of parameters (trainable or otherwise randomly assigned) which determined the final model as a function of the above? My guess is $N_1 \times N_2 \times N_3$ but is this correct?
Category: Data Science

Can I do bagging method as improvement technique to decision tree in research?

Bagging use decision tree as base classifier. I want to use bagging with decision tree(c4.5) as base as the method that improve decision tree(c4.5) in my research that solve problem overfitting. Is that possible because some lecturers said not right as bagging is other classifier not hybrid between two?
Category: Data Science

Difference Bagging and Bootstrap aggregating

Bootstrap belongs to Efron. Tibshirani wrote a book about that in reference to Efron. Bootstrap process for estimating the standard error of statistic s(x). B bootstrap sample are generatied from original data. Finally the standard deviation of the values s(x1),s(x2)..s(xB) is our estimate of the standard error of s(x). The bootstrap estimate of standard error is the standard deviation of bootstrap replications. Typical value for B, number of bootstrap samples range from 50 to 200 for stand.error estimation Breiman wrote …
Category: Data Science

Can bagging ensemble consist of heterogeneous base models?

Bagging or bootstrap aggregation seems to make sense for time series forecasting using an ensemble because bagging randomizes subsets of the data with replacement. However, I've only seen bagging used for homogeneous base learners when constructing ensembles. Stacking is another ensemble technique that uses heterogeneous base learners, but stacking employs cross-validation, which I don't view as being appropriate for economic time series forecasting, even if time series split cross-validation that retains the ordering of observations is used. As you can …
Category: Data Science

Random Forest Stacking Experiment for Imbalanced Data-set Problem

In order to solve a Imbalanced Dataset Problem, I experimented with Random Forest in the given manner (Somewhat inspired by Deep-Learning) Trained a Random Forest which will take in the input data and the predict probability of the label of the trained model will be used as a input to train another Random Forest. Pseudo Code for this : train_X, test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2) rf_model = RandomForestClassifier() rf_model.fit(train_X, train_y) pred = rf_model.predict(test_X) print('******************RANDOM FOREST CM*******************************') print(confusion_matrix(test_y, …
Category: Data Science

Boosting with highly correlated features

I have a conceptual question. My understanding is, that Random Forest can be applied even when features are (highly) correlated. This is because with bagging, the influence of few highly correlated features is moderated, since each feature only occurs in some of the trees which are finally used to build the overall model. My question: With boosting, usually even smaller trees (basically "stunps") are used. Is it a problem to have many (highly) correlated features in a bagging approach?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.