Big difference between Bootstrap Values and Approximately Unbiased p-values

I'm clustering objects over many different descriptors. I chose a hierarchical clustering method (specifically average linking algorithm with euclidean distances) because I wanted to use bootstrap values to give statistical significance to my clusters. I used pvclust (in python, it should be equivalent to r package pvclust). The package calculates both Bootstrap values BP and Approximately Unbiased p-values AU. The results are shown in this dendrogram I don't know how to interpret the fact that UA are relatively high while …
Category: Data Science

Evaluate Dendrogram Statistical Significance

I have N=21 objects and each one has about 80 possible not NaN descriptors. I carried out a hierarchical clustering on the objects and I obtained this dendrogram. I want some kind of 'confidence' index for the dendrogram or for each node. I saw many dendrograms with Bootstrap values (as far as I understand it is the same as Monte Carlo Cross-Validation, but I might be wrong), and i think that in my case they could be used as well. …
Category: Data Science

data analysis leads to linear regression model: how to proceed with prognosis?

Data analysis of a large dataset of project management data together with working hours led me to a surprisingly simple linear model over the key milestones of all projects. Now I am a bit at loss on how to proceed. The stakeholder wants a prediction on working hours spent per milestone and total working hours needed for one project. 1.) Do I calculate an average linear regression plus confidence interval and use that for prediction other project outcomes? 2.) Do …
Category: Data Science

Perform bootstrapping of an ordinary linear regression model, using B=100 bootstrap resamples of my dataset, and getting RMSE

So Im studying machine learning through R, and Im working with the boston data set from the library MASS. I am practicing bootsrapping. I already carried out analysis to determine how ,many distinct data points on average are drawn from the sample to make up a bootsratp resample, using B=100 resamples of the dataset. Next I would like to do two things- perform boostrapping of an ordinary linear regression model using B=100 resamples of the data set again and use …
Category: Data Science

Stratified sampling - use of proxy variable

For splitting of the data into train/test/val I use stratified sampling. Is it appropriate to define strata using information extracted from the dataset? E.g. use machine-learning to model proxy variable used for the strata definition? My worry is the potential data leakage. I wasn't able to find any counter-argument though.
Category: Data Science

Question on bootstrap sampling

I have a corpus of manually annotated (aka "gold standard) documents and a collection of NLP systems annotations on the text from the corpus. I want to do a bootstrap sampling of the system and gold standard to approximate a mean and standard error for various measures so that I can do a series of hypotheses tests using possibly ANOVA. The issue is how do I do the sampling. I have 40 documents in the corpus with ~44K manual annotation …
Category: Data Science

shifting the mean of an array for bootstrap hypothesis testing

I am trying to understand a textbook exercise I am doing. I have an array of data force_b = array([0.172, 0.142, 0.037, 0.453, 0.355, 0.022, 0.502, 0.273, 0.72 ,0.582, 0.198, 0.198, 0.597, 0.516, 0.815, 0.402, 0.605, 0.711, 0.614, 0.468]) with the mean = 0.4191000000000001 I have another mean of 0.55 and I have to shift the data of the array above so that I get an array with the mean of 0.55 The solution in the exercise is translated_force_b = …
Category: Data Science

Oversampling techniques for a class with 1 sample

I have 5 classes, one of them having only one sample. I've been researching techniques to oversample such as SMOTE and Bootstrapping but they do not work for the class with only one sample. I am considering repetition of this class. Are there any other strategies you would recommend? Would repetition followed by SMOTE make sense or not really? Due to the nature of SMOTE using k-nearest neighbors?
Category: Data Science

List of samples that each tree in a random forest is trained on in Scikit-Learn

In Scikit-learn's random forest, you can set bootstrap=True and each tree would select a subset of samples to train on. Is there a way to see which samples are used in each tree? I went through the documentation about the tree estimators and all the attributes of the trees that are made available by Scikit-learn, but none of them seems to provide what I'm looking for.
Category: Data Science

How are the same observation sets treated in Random Forests with Bootstrapping?

Let's assume an extremely small dataset with only 4 observations. And I create a Random Forest model, with a quite large number of trees, say 200. If so, some sample sets that are the same each other can be used in fitting, right? Is it OK? Even though a dataset is large, the same sample sets can be selected, theoretically. Do the Bootstrapping and Random Forest method not care at all or have a step to avoid such duplicates?
Category: Data Science

nnet in caret. Bootstrapping or cross-validation?

I want to train shallow neural network with one hidden layer using nnet in caret. In trainControl, I used method = "cv" to perform 3-fold cross-validation. The snipped the code and results summary are below. myControl <- trainControl(## 3-fold CV method = "cv", number = 3) nnGrid <- expand.grid(size = seq(1, 10, 3), decay = c(0, 0.2, 0.4)) set.seed(1234) nnetFit <- train(choice ~ ., data = db, method = "nnet", maxit = 1000, tuneGrid = nnGrid, trainControl = myControl) I …
Category: Data Science

Resampling train and test data in R

I need to try out few different machine learning methods (SVM, Logistic regression etc.), predict a value either true or false, and write down their AUC and Accuracy of these predictions. I have allready successfully done that, now i have a two matrixes one for AUC and one for Accuracy, and they are filled with data from SVM and logistic regression (one row). Now i need to create models for SVM and Logistic regression 10 more times (i should use …
Category: Data Science

Difference Bagging and Bootstrap aggregating

Bootstrap belongs to Efron. Tibshirani wrote a book about that in reference to Efron. Bootstrap process for estimating the standard error of statistic s(x). B bootstrap sample are generatied from original data. Finally the standard deviation of the values s(x1),s(x2)..s(xB) is our estimate of the standard error of s(x). The bootstrap estimate of standard error is the standard deviation of bootstrap replications. Typical value for B, number of bootstrap samples range from 50 to 200 for stand.error estimation Breiman wrote …
Category: Data Science

Estimate class proportions of a feature, central limit theorem

haven't been feeling smart lately and this is probably the most trivial question ever but I really need to know. I'm trying to point estimate some population parameters. I sampled from 1000 randomly generated bootstrapped samples of 130000 observations and saved the frequency & percentile of each class of a categorical feature. If I were take an average of the percentile or frequency, would that be a good way to estimate a proportion for the classes of a feature? I …
Category: Data Science

About confidence/prediction intervals: parametric methods VS non-parametric (via bootstrap) methods

About the methodology to find confidence and/or prediction intervals in, let's say, a regression problem, I know 2 main options: Checking normality in the estimates/predictions distribution, and applying well known Gaussian alike methods to find those intervals if the distribution is gaussian Applying non-parametric methodologies like bootstraping, so we do not need to assume/check/care whether our distribution is normal With this in mind, I would basically always go for the second one because: it is meant to be generic, as …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.