Confusion on Outliers

I am not able to distinguish outliers: when to go with the std. dev. or when we need to go with the median. My understanding on std. dev. is: if the data point is away from the mean by more than 2 std. dev., we consider that as an outlier. Similarly for the median, we say that any data point that is not in-between Q1 and Q3 is an outlier. So I am confused as to which one to choose. …
Category: Data Science

What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told …
Category: Data Science

What can help decrease outliers' influence on non-tree models?

I have a feature with all the values between 0 and 1 except few outliers larger than 1. I am trying to collect all the methods that can help to decrease outliers' influence on non-tree models: StandardScaler Apply rank transform to the features Apply np.log1p(x) transform to the data MinMaxScaler Winsorization I wasn't able to imagine any other ... I guess that's all?
Category: Data Science

Anomaly Detection

I have a problem where I want to identify Vendors with unusual high amount invoices. What would be the best way to identify such invoices? I am trying to use Isolation Forest but having trouble in grouping by the result by Vendor. Any help will be appreciated. Data is in below format . Vendor ID Amount 1 456 2 1000 1 489 3 896 2 4576
Category: Data Science

Algorithm suggestion for anomaly detection in multivariate time series data

I have time series data containing user actions at certain time intervals eg Date UserId Directory operation Result 01/01/2017 99:00 user1 dir1 created_file success 01/01/2017 99:00 user3 dir10 deleted_file permission_denied unique userIds > 10K 10 distinct operations and 4 distinct Results I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use? The anomaly needs to flag whether some user operations are outliers A very small subset of input data will …
Category: Data Science

Sensitivity analysis in outlier explanation

I am trying to find the outlier explanation using the sensitivity analysis. Let’s consider that my dataset contains 19 different input values and 1 output value (So overall 20 different columns are there and values are numerical). I have already made a prediction model and I am considering the values with high prediction errors are outliers/ anomalies. I have done the sensitivity analysis for individual input values but in the dataset values are correlated with some other input values, e.g. …
Category: Data Science

Isolation forest - grouped by

I'm trying to use isolation forest algorithm for outliers detection. Data has 2 columns: id and REV. Below code gives me ungrouped result. Could you pls advise how to get result grouped by first column (id)? data= pd.read_excel (my_path) outliers_fraction=0.1 scaler = StandardScaler() np_scaled = scaler.fit_transform(data) data = pd.DataFrame(np_scaled) model =IsolationForest(contamination=outliers_fraction) model.fit(data) data['anomaly'] = pd.Series(model.predict(data)) print(data) I have 2 columns: id and REV. I added a picture of what I expect to see as the final result: Tried to use …
Category: Data Science

Explainable anomaly detection

There are plenty of working for explaining prediction in supervised learning (e.g. SHAP values, LIME). What about for anomaly detection in unsupervised learning? Is there any model for which there are libraries that can give you justifications, such as "row x is an anomaly because feature 1 is higher than 5.3 and feature 5 is equal to 'No'"?
Category: Data Science

How to use Autoencoders for outlier detection on images

I have a bunch of images taken from a camera showing a pipe and would like to detect if the pipe is leaking or not. There are very few examples of leaking pipes in the data set. So considering this problem as a supervised learning problem, I think that it may not give us good results due to imbalanced data. I am thinking of using autoencoders and considering it as an outlier detection problem. I am new to deep learning …
Category: Data Science

Given daily sequence of events with only event ID labels (alphanum strings), what algorithms can be used to detect sequences that are outliers?

For example, the data might be something like this: Sequence 1: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] Sequence 2: ["CBA", "AAA", "YY123", "LMNOP", "AABBCC"] Sequence 3: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] ... Sequence N: ["DEF", "AAA", "ZZ123", "YYZZZ45", "AABBCC"] Sequence 1 and 3 are the same, but sequence 2 and N are different. In the data set, there will be thousands of these sequences every day. Additional questions: How could I calculate similarity (or difference) measure between sequences with sequences of …
Category: Data Science

Replace Values in Vector on Specific Place in R

I want to make $5^{th}$,$10^{th}$,$15^{th}$,$20^{th}$ and $25^{th}$ values of vector an outlier in all xs by using x1 [5]+OT1,x1 [10]+OT1 and so on. For this purpose I have made this R code, n=25 x1<-runif(n,0,1) x2<-runif(n,0,1) x3<-runif(n,0,1) x4<-runif(n,0,1) x<-data.frame(x1,x2,x3,x4) OT1<-mean(x1)+100 OT2<-mean(x2)+100 OT3<-mean(x3)+100 OT4<-mean(x4)+100 I have tried command replace() and also modify(), but none of them replace them at once at least in one vector. Kindly help me in this manner. Edit by using comment of @user2974951 I tried this x1[seq(5,25,5)]=x1[seq(5,25,5)]+100 Nx1<-replace(x1,x1==x1[5],x1 …
Category: Data Science

Many separation line using RBF kernel in SVM

Below is my code, it take a range of a number, creates a new column label that contains either -1 or 1. In case the number is higher than 14000 , we label it with -1 (outlier) In case the number is lower than 14000 , we label it with 1 (normal) ## Here I just import all the libraries and import the column with my dataset ## Yes, I am trying to find anomalies using only the data from …
Category: Data Science

Real-Time Outlier/Anomaly Detection?

My data is the usage/playing statistics for players of a specific game. One data point for a user is aggregated statistics for one week. The goal is to be able to detect when the account of the player was stolen/hacked/anything else went wrong. So my idea is for each player to have data points that each represent one week and then check whether the latest week is an outlier in the cluster. If it is - something is wrong with …
Category: Data Science

Should outliers be removed only from the target variable or from any variable where they are found?

What I often do is that I check boxplots and histograms for target/dependent variable and after much caution, treat/remove the outliers. But this is what I do only for the target variable. I.e., if considered the removal, I'd simply drop the entire row where my target value was found outlying. Suppose if I am having outliers in some independent variables as well. What should I do there? Either, Should I ignore them? Or, Should I take the same approach with …
Category: Data Science

How variable alpha changes SGDRegressor behavior for outlier?

I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have: from sklearn.linear_model import SGDRegressor out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)] alpha=[0.0001, 1, 100] N= len(out) plt.figure(figsize=(20,15)) j=1 for i in alpha: X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included …
Category: Data Science

How do outliers and missing values impact these classifiers?

I am currently working with a bunch of classification models especially Logistic regression, KNN, Naive Bayes, SVM, and Decision Trees for my machine learning class. I know how to handle finding and removing the missing values and the outliers. But I would like to know which of the above models would perform really badly if the outliers and missing values are not removed. Like if I decide to leave the outliers and missing values in the dataset which model should …
Category: Data Science

How to remove outliers properly?

I was wondering what is the best practice for removing outliers from data. Plotting a boxplot for each feature (column of the dataset) and removing data that fall outside the whiskers seems like a naive and problematic approach. For example, say you have many individuals with a 'gender' label and an 'income' label. Also assume that there are many more men in the dataset than women. Unfortunately, due to income disparity we may see that women receive a lower wage …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.