Removing outliers from a multi-dimensional dataset & Data augmentation

Removing the outliers of a single-dimensional data can be easily done by removing the points that are outside of the IQR range. But how should the process of detecting and removing outliers be done if the dataset is composed of multiple dimensions of data?

Here's my approach: the dataset consisted seven different dimensions of data. When illustrated on a dataframe, there are seven different columns; each row acting as a metadata explaining the properties of a single data.

I looped through each individual column and removed the row that contains the data that exist outside the IQR range. As seven different columns are grouped in a row, I thought that looping through each column and removing the outliers would result in a dataframe containing the data within the IQR range, even though its overall quantity may have been reduced.

Now I would like to ask about data augmentation. While a dedicated library exists for augmenting image datasets by randomly modifying its structure, it's challenging to find an approach for augmentation of numerical data. Thus, I created my own approach.

After removing the outliers from the dataset, I plotted a polynomial regression function in order to find the relationship between the target data (data to be predicted) and each individual feature data (data to be used in training the model) - how a certain target data value relates to a certain feature data.

Then, I randomly generated target data values that lie within the IQR range; used the previously discovered regression relationship to get other feature data values. And using the augmented data set, I trained the model and made predictions.

There are two main fallacies in the data augmentation approach I made.

  • Despite the randomly generated target data values are within the IQR range, other feature data values derived from the regression may exist outside the IQR range; thus, making the outlier elimination process to be meaningless. I did realize this, but decided to continue as removing the outliers will help in retrieving a more accurate regression function illustrating the relationship between the target data and each feature data.
  • The original dataset does not contain much quantity of data for training. The dataset mainly consists of augmented data; training the model centering on artificial data may lead to unwanted results. Compared to the performance of the model trained using the original dataset, the model trained using the augmented dataset did not show significant improvement in performance.

It still remains as a question whether I made a proper approach or not.

TL; DR: I have two questions regarding data preprocessing.

  • How should detecting and removing outliers be done in a dataset containing data of multiple dimensions?
  • Is there a validated approach for numerical data augmentation, just like the ImageDataGenerator used for augmenting image data?

Topic data-augmentation data outlier dataset

Category Data Science


Removing outliers in a high-dimensional scenario can for example be done after dimension reduction by principal component analysis. In the dimension-reduced space either boxplots (1 dimension), bagplots (2 dimension) or gemplots (3 dimensions) can be applied to detect outliers. For details please look at Kruppa, J., & Jung, K. (2017). Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots. BMC bioinformatics, 18(1), 1-10. full text


Why do you want to remove outliers? Do you think these are wrong data? Do you think they have outsized impact on the model? Are these rows the model should get right or wrong and you move to the validation set? Other reasons? Know why you want to identify outliers then choose the appropriate method. I think it is better to find what may be outliers then examine these to understand the appropriate treatment. I also plot the data to get a visual view of these outliers. Just because a row has large values, does not mean it is "wrong".

Here is a thread that discusses. Just because you can remove outliers does not mean you should. Know your reason for removal. If I remove rows, I move them to their own validation set so I can test with my trained model.

But to answer the question - one way to identify multi-dimension outliers is using random forests and proximity matrices. Random forests can be used as exploratory data analysis. In this case, fit a random forest, build a proximity matrix, then analyze which records are often by themselves in the leaf nodes or with different partners all of the time. This shows which records are different enough that a model thinks they are different.

A search on cross validated shows many other outlier detection methods.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.