sampling

Subsampling the “right” amout of data to train an ML model

giz

2022年6月4日 04:20

I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset. Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that …

Topic: sampling classification bigdata

Category: Data Science

Downsample GPS track

user2445254

2022年6月1日 19:02

I am working with GPS track files (list of X and Y coordinates). I have tracks with a high sampling rate and want to downsample the track for easier handling. The obvious way would be to create a new list of points, and to keep only (for example) every 100th point of the track. The problem is that this could remove important extremes, such as curves. Do you know algorithms, which allow to downsample the two dimensional array, while keeping …

Topic: sampling python

Category: Data Science

Optimally sample from multiple distributions

dustingthewind

2022年5月18日 22:16

I have two datasets both of the form from the table below. I am interested in downselecting from dataset A by sampling from the distribution of values from dataset B. However, I want to consider both the Distance and Duration when downselecting such that the distribution of both parameters in my end-product from dataset A matches as best as possible the distribution of these parameters from dataset B. Anyone have suggestions for tools (preferably in python) that would help me …

Topic: sampling python

Category: Data Science

What's the order in applying SMOTE transformation in a pipeline?

dummyds

2022年5月18日 01:58

Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values. This is my code: from sklearn.pipeline import Pipeline # SELECAO DE VARIAVEIS categorical_features = [ "MARRIED", "RACE" ] continuous_features = [ "AGE", "SALARY" ] features = [ "MARRIED", "RACE", "AGE", "SALARY" ] # PIPELINE continuous_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("scaler", StandardScaler()), ] ) categorical_transformer = Pipeline( steps=[ …

Topic: smote sampling logistic-regression python predictive-modeling

Category: Data Science

Slice NumPy arrays differently along axes (without looping)

piano_man

2022年5月17日 20:36

I am trying to analyze a temporal signal sampled by a 2D sensor. Effectively, this means integrating the signal values for each sensor pixel (array row/column coordinate) at the times each pixel is active. Since the start time and duration that each pixel is active are different, I effectively need to slice the signal for different values along each row and column. # Here is the setup for the problem import numpy as np def signal(t): return np.sin(t/2)*np.exp(-t/8) t = …

Topic: numpy sampling indexing

Category: Data Science

How to improve L2 loss for generative autoencoder

wigeon

2022年5月4日 01:43

I am working with a modified generative autoencoder and having issues getting the L2 sufficiently low. I think problem is that because my data is over a very large range and is standardized to values between zero and one, small discrepancies in the standardized data lead to larger ones in the unstandardized data. Additionally, my other loss terms, despite being averaged by number of points in the batch, are usually orders of magnitude larger than my L2 loss, which I …

Topic: generative-models autoencoder loss-function sampling machine-learning

Category: Data Science

Overfitted model produces similar AUC on test set, so which model do I go with?

rayven1lk

2022年4月26日 21:02

I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …

Topic: gridsearchcv overfitting sampling class-imbalance random-forest

Category: Data Science

Difference between Jackknife vs bootsrap vs cross validation

PicaR

2022年4月25日 14:24

I have doubts about the differences between these three methods and I would like to clarify the following: Main differences Advantages of one over the other Context of use of each method etc... If anyone could help me, I would appreciate it.

Topic: bootstraping sampling cross-validation evaluation machine-learning

Category: Data Science

the mean and standard deviation aren't the same as those of the input data i provided after sampling

codebreaker12

2022年4月25日 14:21

I have a log-normal mean and a standard deviation. after i converted them to the underlying normal distribution's parameters mu and sigma, I sampled from the log-normal distribution however when i take the mean and standard deviation of this sampled data i don't get the results i plugged in at first. This only happens when the log-normal mean is way smaller than the log-normal standard deviation otherwise it works. how do i prevent this from happening and get the input …

Topic: distribution probability scipy sampling python

Category: Data Science

Getting a balanced sample across many variables

user

2022年4月24日 11:04

Let’s say each element in my population has several attributes. Let’s call then A, B, C, D, E, F. Let’s say, for simplicity, each attribute has 10 values (but could be any number between 2 and 30). Now I want to get a sample such that the distribution is the same across all features. So for example if the whole population has about 15% of people in feature A with value 1, my sample should be the same. What should …

Topic: multivariate-distribution distribution sampling statistics

Category: Data Science

under sample to get specific number of samples per class using tomek links of imblearn

Naveen Reddy Marthala

2022年4月24日 07:05

I have a dataset with classes in my target column distributed like shown below. counts percents 6 1507 27.045944 3 1301 23.348887 5 661 11.862886 4 588 10.552764 7 564 10.122039 8 432 7.753051 1 416 7.465901 2 61 1.094760 9 38 0.681981 10 4 0.071788 I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 & 5 only have ~588 samples available after undersampling. Here's …

Topic: imbalanced-learn sampling class-imbalance python

Category: Data Science

Correctly evaluate model with oversampling and cross-validation

Matteo Felici

2022年4月22日 16:06

I'm dealing with a classic case of dataset with binary imbalanced target (event 3%, non event 97%). My idea is to apply some sort of sampling (over/under, SMOTE etc.) to address the issue. As I see, the correct way of doing this is to sample ONLY the train set, in order to have a test performance that is more similar to reality. Moreover, I want to use CV for hyperparameters tuning. So, the tasks in order are Divide dataset into …

Topic: overfitting sampling cross-validation

Category: Data Science

Generating artificial data to extend learning set

rik

2022年4月21日 18:43

I have dataset containing 42 instances(X) and one final Y on which i want to perform LASSO regression.All are continuous and numerical. As the sample size small, I wish to extend it. I am kind of aware of algorithms like SMOTE used for extending imbalanced dataset. Is there anything available for my case where there is no imbalance?

Topic: lasso data regression sampling dataset

Category: Data Science

How to use Splitting for startifying in sklearn for multiple files

user12

2022年4月11日 08:41

I have csv data file for binary classification. I divided it into 5 multiple files and tried to apply the stratification technique so the class label has the same proportion for all the files. but I am getting the error ValueError: Found input variables with inconsistent numbers of samples: even the whole data is divisible by 5. I think the splitter takes a pandas data frame as input, and I am asking it to stratify by a specific column. The …

Topic: sampling cross-validation scikit-learn machine-learning

Category: Data Science

Resampling a normally distributed dataset for regression problems?

marqram

2022年4月7日 23:01

I have a dataset from an operating process having 5 measurements and 1 outcome. All values are normally distributed. When I train a regression model on the dataset it performs good on the majority of the dataset - the default operating condition of the process. It performs much worse though on other than default operating conditions, values distant from the mean. If it were a classification problem I would treat this as class imbalance and perform some resampling technique to …

Topic: regression sampling

Category: Data Science

Generating a set of different scenarios based on some initial observations

Dimits

2022年4月4日 00:03

I have a in my hands 3 different time series which model 3 different scenarios (base, downside, upside). Every of this time-series depends on a set of 11 different attributes, which take values for different time intervals. Most of the different features of the input are highly correlated. There is also a (cdf) probability function which defines how probably every scenario is (every quintile), for every point in time. In my case, I want to create more input data based …

Topic: data-science-model distribution sampling time-series python

Category: Data Science

KDE Sampling with negative density and/or class-specific weighting

wigeon

2022年3月27日 04:51

I have a dataset which contains two overlapping distributions/classes of points. I have been trying to sample from just one of these distributions/classes using the scikit learn Kernel Density class, but I am finding this does not work well in overlapping regions. Is there a way to do this sort of KDE sampling that also takes into account/avoids areas where these two distributions overlap? Ideally I would like to sample more often in non-overlapping areas or, when this is not …

Topic: density-estimation noise distribution sampling scikit-learn

Category: Data Science

train_test_split ValueError: Input contains NaN

cyanide

2022年3月24日 07:02

I tried to do a stratified sampling by way of train_test_split in order to save myself some trouble later. So I wrote the following lines: from sklearn.model_selection import train_test_split X=data_df y=data_df.pop('class') X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.125, stratify=y) I got the error: ValueError: Input contains NaN Any help is welcome!

Topic: sampling python

Category: Data Science

How to generate a random sample and distribute values based in an probability distribution?

neves

2022年3月23日 21:06

I want to generate a random sample based on this probability distribution: The line is the KDE of the histogram. My random sample will have n values, the value is a number of points. Each of the n values generates an amount of points p that must be distributed among the population. So I must distribute the total of n * p points. The distribution of points must follow the probability distribution above. How should I generate a random sample …

Topic: distribution probability sampling

Category: Data Science

Recover a integer valued function with *-learning

M.K. aka Grisu

2022年3月23日 07:02

I have the following problem. From a technical model we have a function $f(n,p)$ approximating its runtime. The function $f$ which maps $$ f: \mathbb{N} \times \mathbb{P} \to \mathbb{R}_{+} $$ where $\mathbb{P} = \{1,\ldots,50\} \subset\mathbb{N}$. $n$ defines the amount of input and $p$ is a parameter of the process, which has a continuous influence on the runtime. We are interested in the value of $p$ such that $f(n,p)$ for a given $n$. At the run the experiment some $n$, like …

Topic: sampling machine-learning

Category: Data Science

About