Removing outliers from a multi-dimensional dataset & Data augmentation
Removing the outliers of a single-dimensional data can be easily done by removing the points that are outside of the IQR range. But how should the process of detecting and removing outliers be done if the dataset is composed of multiple dimensions of data?
Here's my approach: the dataset consisted seven different dimensions of data. When illustrated on a dataframe, there are seven different columns; each row acting as a metadata explaining the properties of a single data.
I looped through each individual column and removed the row that contains the data that exist outside the IQR range. As seven different columns are grouped in a row, I thought that looping through each column and removing the outliers would result in a dataframe containing the data within the IQR range, even though its overall quantity may have been reduced.
Now I would like to ask about data augmentation. While a dedicated library exists for augmenting image datasets by randomly modifying its structure, it's challenging to find an approach for augmentation of numerical data. Thus, I created my own approach.
After removing the outliers from the dataset, I plotted a polynomial regression function in order to find the relationship between the target data (data to be predicted) and each individual feature data (data to be used in training the model) - how a certain target data value relates to a certain feature data.
Then, I randomly generated target data values that lie within the IQR range; used the previously discovered regression relationship to get other feature data values. And using the augmented data set, I trained the model and made predictions.
There are two main fallacies in the data augmentation approach I made.
- Despite the randomly generated target data values are within the IQR range, other feature data values derived from the regression may exist outside the IQR range; thus, making the outlier elimination process to be meaningless. I did realize this, but decided to continue as removing the outliers will help in retrieving a more accurate regression function illustrating the relationship between the target data and each feature data.
- The original dataset does not contain much quantity of data for training. The dataset mainly consists of augmented data; training the model centering on artificial data may lead to unwanted results. Compared to the performance of the model trained using the original dataset, the model trained using the augmented dataset did not show significant improvement in performance.
It still remains as a question whether I made a proper approach or not.
TL; DR: I have two questions regarding data preprocessing.
- How should detecting and removing outliers be done in a dataset containing data of multiple dimensions?
- Is there a validated approach for numerical data augmentation, just like the ImageDataGenerator used for augmenting image data?
Topic data-augmentation data outlier dataset
Category Data Science