bootstraping

Big difference between Bootstrap Values and Approximately Unbiased p-values

Mirko

2022年5月12日 09:07

I'm clustering objects over many different descriptors. I chose a hierarchical clustering method (specifically average linking algorithm with euclidean distances) because I wanted to use bootstrap values to give statistical significance to my clusters. I used pvclust (in python, it should be equivalent to r package pvclust). The package calculates both Bootstrap values BP and Approximately Unbiased p-values AU. The results are shown in this dendrogram I don't know how to interpret the fact that UA are relatively high while …

Topic: bootstraping automl clustering

Category: Data Science

Evaluate Dendrogram Statistical Significance

Mirko

2022年5月3日 10:33

I have N=21 objects and each one has about 80 possible not NaN descriptors. I carried out a hierarchical clustering on the objects and I obtained this dendrogram. I want some kind of 'confidence' index for the dendrogram or for each node. I saw many dendrograms with Bootstrap values (as far as I understand it is the same as Monte Carlo Cross-Validation, but I might be wrong), and i think that in my case they could be used as well. …

Topic: confidence bootstraping scipy python clustering

Category: Data Science

Difference between Jackknife vs bootsrap vs cross validation

PicaR

2022年4月25日 14:24

I have doubts about the differences between these three methods and I would like to clarify the following: Main differences Advantages of one over the other Context of use of each method etc... If anyone could help me, I would appreciate it.

Topic: bootstraping sampling cross-validation evaluation machine-learning

Category: Data Science

data analysis leads to linear regression model: how to proceed with prognosis?

Nimrod Ets

2021年10月13日 04:13

Data analysis of a large dataset of project management data together with working hours led me to a surprisingly simple linear model over the key milestones of all projects. Now I am a bit at loss on how to proceed. The stakeholder wants a prediction on working hours spent per milestone and total working hours needed for one project. 1.) Do I calculate an average linear regression plus confidence interval and use that for prediction other project outcomes? 2.) Do …

Topic: confidence bootstraping linear-regression

Category: Data Science

Perform bootstrapping of an ordinary linear regression model, using B=100 bootstrap resamples of my dataset, and getting RMSE

Robbie Meaney

2021年10月4日 05:06

So Im studying machine learning through R, and Im working with the boston data set from the library MASS. I am practicing bootsrapping. I already carried out analysis to determine how ,many distinct data points on average are drawn from the sample to make up a bootsratp resample, using B=100 resamples of the dataset. Next I would like to do two things- perform boostrapping of an ordinary linear regression model using B=100 resamples of the data set again and use …

Topic: bootstraping rmse r machine-learning

Category: Data Science

Stratified sampling - use of proxy variable

holoubekm

2021年9月12日 18:52

For splitting of the data into train/test/val I use stratified sampling. Is it appropriate to define strata using information extracted from the dataset? E.g. use machine-learning to model proxy variable used for the strata definition? My worry is the potential data leakage. I wasn't able to find any counter-argument though.

Topic: bootstraping sampling dataset

Category: Data Science

Question on bootstrap sampling

horcle_buzz

2021年9月4日 19:07

I have a corpus of manually annotated (aka "gold standard) documents and a collection of NLP systems annotations on the text from the corpus. I want to do a bootstrap sampling of the system and gold standard to approximate a mean and standard error for various measures so that I can do a series of hypotheses tests using possibly ANOVA. The issue is how do I do the sampling. I have 40 documents in the corpus with ~44K manual annotation …

Topic: bootstraping nlp

Category: Data Science

shifting the mean of an array for bootstrap hypothesis testing

Nimrod Ets

2021年8月31日 06:12

I am trying to understand a textbook exercise I am doing. I have an array of data force_b = array([0.172, 0.142, 0.037, 0.453, 0.355, 0.022, 0.502, 0.273, 0.72 ,0.582, 0.198, 0.198, 0.597, 0.516, 0.815, 0.402, 0.605, 0.711, 0.614, 0.468]) with the mean = 0.4191000000000001 I have another mean of 0.55 and I have to shift the data of the array above so that I get an array with the mean of 0.55 The solution in the exercise is translated_force_b = …

Topic: mean bootstraping mean-shift

Category: Data Science

Oversampling techniques for a class with 1 sample

Ossz

2021年7月14日 22:04

I have 5 classes, one of them having only one sample. I've been researching techniques to oversample such as SMOTE and Bootstrapping but they do not work for the class with only one sample. I am considering repetition of this class. Are there any other strategies you would recommend? Would repetition followed by SMOTE make sense or not really? Due to the nature of SMOTE using k-nearest neighbors?

Topic: oversampling bootstraping smote

Category: Data Science

List of samples that each tree in a random forest is trained on in Scikit-Learn

theonionring0127

2021年4月29日 07:01

In Scikit-learn's random forest, you can set bootstrap=True and each tree would select a subset of samples to train on. Is there a way to see which samples are used in each tree? I went through the documentation about the tree estimators and all the attributes of the trees that are made available by Scikit-learn, but none of them seems to provide what I'm looking for.

Topic: bootstraping cart random-forest scikit-learn

Category: Data Science

How are the same observation sets treated in Random Forests with Bootstrapping?

jlee

2021年2月10日 13:29

Let's assume an extremely small dataset with only 4 observations. And I create a Random Forest model, with a quite large number of trees, say 200. If so, some sample sets that are the same each other can be used in fitting, right? Is it OK? Even though a dataset is large, the same sample sets can be selected, theoretically. Do the Bootstrapping and Random Forest method not care at all or have a step to avoid such duplicates?

Topic: bootstraping random-forest

Category: Data Science

nnet in caret. Bootstrapping or cross-validation?

SiH

2020年11月3日 00:44

I want to train shallow neural network with one hidden layer using nnet in caret. In trainControl, I used method = "cv" to perform 3-fold cross-validation. The snipped the code and results summary are below. myControl <- trainControl(## 3-fold CV method = "cv", number = 3) nnGrid <- expand.grid(size = seq(1, 10, 3), decay = c(0, 0.2, 0.4)) set.seed(1234) nnetFit <- train(choice ~ ., data = db, method = "nnet", maxit = 1000, tuneGrid = nnGrid, trainControl = myControl) I …

Topic: bootstraping cross-validation neural-network r machine-learning

Category: Data Science

Resampling train and test data in R

znoris007

2020年7月30日 11:24

I need to try out few different machine learning methods (SVM, Logistic regression etc.), predict a value either true or false, and write down their AUC and Accuracy of these predictions. I have allready successfully done that, now i have a two matrixes one for AUC and one for Accuracy, and they are filled with data from SVM and logistic regression (one row). Now i need to create models for SVM and Logistic regression 10 more times (i should use …

Topic: bootstraping r machine-learning

Category: Data Science

Difference Bagging and Bootstrap aggregating

martin

2020年7月12日 20:46

Bootstrap belongs to Efron. Tibshirani wrote a book about that in reference to Efron. Bootstrap process for estimating the standard error of statistic s(x). B bootstrap sample are generatied from original data. Finally the standard deviation of the values s(x1),s(x2)..s(xB) is our estimate of the standard error of s(x). The bootstrap estimate of standard error is the standard deviation of bootstrap replications. Typical value for B, number of bootstrap samples range from 50 to 200 for stand.error estimation Breiman wrote …

Topic: bootstraping bagging aggregation

Category: Data Science

Estimate class proportions of a feature, central limit theorem

Laurent

2020年7月10日 16:13

haven't been feeling smart lately and this is probably the most trivial question ever but I really need to know. I'm trying to point estimate some population parameters. I sampled from 1000 randomly generated bootstrapped samples of 130000 observations and saved the frequency & percentile of each class of a categorical feature. If I were take an average of the percentile or frequency, would that be a good way to estimate a proportion for the classes of a feature? I …

Topic: bootstraping parameter-estimation pandas dataset

Category: Data Science

About confidence/prediction intervals: parametric methods VS non-parametric (via bootstrap) methods

German C M

2020年4月12日 08:24

About the methodology to find confidence and/or prediction intervals in, let's say, a regression problem, I know 2 main options: Checking normality in the estimates/predictions distribution, and applying well known Gaussian alike methods to find those intervals if the distribution is gaussian Applying non-parametric methodologies like bootstraping, so we do not need to assume/check/care whether our distribution is normal With this in mind, I would basically always go for the second one because: it is meant to be generic, as …

Topic: bootstraping non-parametric machine-learning

Category: Data Science