I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset. Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that …
I am working with GPS track files (list of X and Y coordinates). I have tracks with a high sampling rate and want to downsample the track for easier handling. The obvious way would be to create a new list of points, and to keep only (for example) every 100th point of the track. The problem is that this could remove important extremes, such as curves. Do you know algorithms, which allow to downsample the two dimensional array, while keeping …
I have two datasets both of the form from the table below. I am interested in downselecting from dataset A by sampling from the distribution of values from dataset B. However, I want to consider both the Distance and Duration when downselecting such that the distribution of both parameters in my end-product from dataset A matches as best as possible the distribution of these parameters from dataset B. Anyone have suggestions for tools (preferably in python) that would help me …
Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values. This is my code: from sklearn.pipeline import Pipeline # SELECAO DE VARIAVEIS categorical_features = [ "MARRIED", "RACE" ] continuous_features = [ "AGE", "SALARY" ] features = [ "MARRIED", "RACE", "AGE", "SALARY" ] # PIPELINE continuous_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("scaler", StandardScaler()), ] ) categorical_transformer = Pipeline( steps=[ …
I am trying to analyze a temporal signal sampled by a 2D sensor. Effectively, this means integrating the signal values for each sensor pixel (array row/column coordinate) at the times each pixel is active. Since the start time and duration that each pixel is active are different, I effectively need to slice the signal for different values along each row and column. # Here is the setup for the problem import numpy as np def signal(t): return np.sin(t/2)*np.exp(-t/8) t = …
I am working with a modified generative autoencoder and having issues getting the L2 sufficiently low. I think problem is that because my data is over a very large range and is standardized to values between zero and one, small discrepancies in the standardized data lead to larger ones in the unstandardized data. Additionally, my other loss terms, despite being averaged by number of points in the batch, are usually orders of magnitude larger than my L2 loss, which I …
I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling. Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes. I generated a binary classification dataset with following: # Generate binary classification dataset with 5% minority class, …
I have doubts about the differences between these three methods and I would like to clarify the following: Main differences Advantages of one over the other Context of use of each method etc... If anyone could help me, I would appreciate it.
I have a log-normal mean and a standard deviation. after i converted them to the underlying normal distribution's parameters mu and sigma, I sampled from the log-normal distribution however when i take the mean and standard deviation of this sampled data i don't get the results i plugged in at first. This only happens when the log-normal mean is way smaller than the log-normal standard deviation otherwise it works. how do i prevent this from happening and get the input …
Let’s say each element in my population has several attributes. Let’s call then A, B, C, D, E, F. Let’s say, for simplicity, each attribute has 10 values (but could be any number between 2 and 30). Now I want to get a sample such that the distribution is the same across all features. So for example if the whole population has about 15% of people in feature A with value 1, my sample should be the same. What should …
I have a dataset with classes in my target column distributed like shown below. counts percents 6 1507 27.045944 3 1301 23.348887 5 661 11.862886 4 588 10.552764 7 564 10.122039 8 432 7.753051 1 416 7.465901 2 61 1.094760 9 38 0.681981 10 4 0.071788 I would like to under sample my data and include, only 588 samples for a class at maximum; so that the classess 6, 3 & 5 only have ~588 samples available after undersampling. Here's …
I'm dealing with a classic case of dataset with binary imbalanced target (event 3%, non event 97%). My idea is to apply some sort of sampling (over/under, SMOTE etc.) to address the issue. As I see, the correct way of doing this is to sample ONLY the train set, in order to have a test performance that is more similar to reality. Moreover, I want to use CV for hyperparameters tuning. So, the tasks in order are Divide dataset into …
I have dataset containing 42 instances(X) and one final Y on which i want to perform LASSO regression.All are continuous and numerical. As the sample size small, I wish to extend it. I am kind of aware of algorithms like SMOTE used for extending imbalanced dataset. Is there anything available for my case where there is no imbalance?
I have csv data file for binary classification. I divided it into 5 multiple files and tried to apply the stratification technique so the class label has the same proportion for all the files. but I am getting the error ValueError: Found input variables with inconsistent numbers of samples: even the whole data is divisible by 5. I think the splitter takes a pandas data frame as input, and I am asking it to stratify by a specific column. The …
I have a dataset from an operating process having 5 measurements and 1 outcome. All values are normally distributed. When I train a regression model on the dataset it performs good on the majority of the dataset - the default operating condition of the process. It performs much worse though on other than default operating conditions, values distant from the mean. If it were a classification problem I would treat this as class imbalance and perform some resampling technique to …
I have a in my hands 3 different time series which model 3 different scenarios (base, downside, upside). Every of this time-series depends on a set of 11 different attributes, which take values for different time intervals. Most of the different features of the input are highly correlated. There is also a (cdf) probability function which defines how probably every scenario is (every quintile), for every point in time. In my case, I want to create more input data based …
I have a dataset which contains two overlapping distributions/classes of points. I have been trying to sample from just one of these distributions/classes using the scikit learn Kernel Density class, but I am finding this does not work well in overlapping regions. Is there a way to do this sort of KDE sampling that also takes into account/avoids areas where these two distributions overlap? Ideally I would like to sample more often in non-overlapping areas or, when this is not …
I tried to do a stratified sampling by way of train_test_split in order to save myself some trouble later. So I wrote the following lines: from sklearn.model_selection import train_test_split X=data_df y=data_df.pop('class') X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.125, stratify=y) I got the error: ValueError: Input contains NaN Any help is welcome!
I want to generate a random sample based on this probability distribution: The line is the KDE of the histogram. My random sample will have n values, the value is a number of points. Each of the n values generates an amount of points p that must be distributed among the population. So I must distribute the total of n * p points. The distribution of points must follow the probability distribution above. How should I generate a random sample …
I have the following problem. From a technical model we have a function $f(n,p)$ approximating its runtime. The function $f$ which maps $$ f: \mathbb{N} \times \mathbb{P} \to \mathbb{R}_{+} $$ where $\mathbb{P} = \{1,\ldots,50\} \subset\mathbb{N}$. $n$ defines the amount of input and $p$ is a parameter of the process, which has a continuous influence on the runtime. We are interested in the value of $p$ such that $f(n,p)$ for a given $n$. At the run the experiment some $n$, like …