I am not able to distinguish outliers: when to go with the std. dev. or when we need to go with the median. My understanding on std. dev. is: if the data point is away from the mean by more than 2 std. dev., we consider that as an outlier. Similarly for the median, we say that any data point that is not in-between Q1 and Q3 is an outlier. So I am confused as to which one to choose. …
I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told …
In the original publication of the Isolation Forest algorithm, the authors mention a height limit parameter to control the granularity of the algorithm. I did not find that explicit parameter on the Sklearn implementation of the algorithm, and I was wondering whether it is possible to control granularity in some other way?
I have a feature with all the values between 0 and 1 except few outliers larger than 1. I am trying to collect all the methods that can help to decrease outliers' influence on non-tree models: StandardScaler Apply rank transform to the features Apply np.log1p(x) transform to the data MinMaxScaler Winsorization I wasn't able to imagine any other ... I guess that's all?
How does feature selection impact outlier detection and also, removing outliers impact feature selection? It could be a basic question. However, just to know the boundaries, I asked. Thanks in advance. I have gone through the following:Feature selection and outlier order
I have a problem where I want to identify Vendors with unusual high amount invoices. What would be the best way to identify such invoices? I am trying to use Isolation Forest but having trouble in grouping by the result by Vendor. Any help will be appreciated. Data is in below format . Vendor ID Amount 1 456 2 1000 1 489 3 896 2 4576
I have time series data containing user actions at certain time intervals eg Date UserId Directory operation Result 01/01/2017 99:00 user1 dir1 created_file success 01/01/2017 99:00 user3 dir10 deleted_file permission_denied unique userIds > 10K 10 distinct operations and 4 distinct Results I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use? The anomaly needs to flag whether some user operations are outliers A very small subset of input data will …
I am trying to find the outlier explanation using the sensitivity analysis. Let’s consider that my dataset contains 19 different input values and 1 output value (So overall 20 different columns are there and values are numerical). I have already made a prediction model and I am considering the values with high prediction errors are outliers/ anomalies. I have done the sensitivity analysis for individual input values but in the dataset values are correlated with some other input values, e.g. …
I'm trying to use isolation forest algorithm for outliers detection. Data has 2 columns: id and REV. Below code gives me ungrouped result. Could you pls advise how to get result grouped by first column (id)? data= pd.read_excel (my_path) outliers_fraction=0.1 scaler = StandardScaler() np_scaled = scaler.fit_transform(data) data = pd.DataFrame(np_scaled) model =IsolationForest(contamination=outliers_fraction) model.fit(data) data['anomaly'] = pd.Series(model.predict(data)) print(data) I have 2 columns: id and REV. I added a picture of what I expect to see as the final result: Tried to use …
There are plenty of working for explaining prediction in supervised learning (e.g. SHAP values, LIME). What about for anomaly detection in unsupervised learning? Is there any model for which there are libraries that can give you justifications, such as "row x is an anomaly because feature 1 is higher than 5.3 and feature 5 is equal to 'No'"?
I have a bunch of images taken from a camera showing a pipe and would like to detect if the pipe is leaking or not. There are very few examples of leaking pipes in the data set. So considering this problem as a supervised learning problem, I think that it may not give us good results due to imbalanced data. I am thinking of using autoencoders and considering it as an outlier detection problem. I am new to deep learning …
For example, the data might be something like this: Sequence 1: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] Sequence 2: ["CBA", "AAA", "YY123", "LMNOP", "AABBCC"] Sequence 3: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] ... Sequence N: ["DEF", "AAA", "ZZ123", "YYZZZ45", "AABBCC"] Sequence 1 and 3 are the same, but sequence 2 and N are different. In the data set, there will be thousands of these sequences every day. Additional questions: How could I calculate similarity (or difference) measure between sequences with sequences of …
I want to make $5^{th}$,$10^{th}$,$15^{th}$,$20^{th}$ and $25^{th}$ values of vector an outlier in all xs by using x1 [5]+OT1,x1 [10]+OT1 and so on. For this purpose I have made this R code, n=25 x1<-runif(n,0,1) x2<-runif(n,0,1) x3<-runif(n,0,1) x4<-runif(n,0,1) x<-data.frame(x1,x2,x3,x4) OT1<-mean(x1)+100 OT2<-mean(x2)+100 OT3<-mean(x3)+100 OT4<-mean(x4)+100 I have tried command replace() and also modify(), but none of them replace them at once at least in one vector. Kindly help me in this manner. Edit by using comment of @user2974951 I tried this x1[seq(5,25,5)]=x1[seq(5,25,5)]+100 Nx1<-replace(x1,x1==x1[5],x1 …
I have a set of documents and I want to identify and remove the outlier documents. I am just wondering if doc2vec can be used for this task. Or are there any recently evolved, promising algorithms that I can use for this task? EDIT I am currently using a bag of words model to identify outliers.
Below is my code, it take a range of a number, creates a new column label that contains either -1 or 1. In case the number is higher than 14000 , we label it with -1 (outlier) In case the number is lower than 14000 , we label it with 1 (normal) ## Here I just import all the libraries and import the column with my dataset ## Yes, I am trying to find anomalies using only the data from …
My data is the usage/playing statistics for players of a specific game. One data point for a user is aggregated statistics for one week. The goal is to be able to detect when the account of the player was stolen/hacked/anything else went wrong. So my idea is for each player to have data points that each represent one week and then check whether the latest week is an outlier in the cluster. If it is - something is wrong with …
What I often do is that I check boxplots and histograms for target/dependent variable and after much caution, treat/remove the outliers. But this is what I do only for the target variable. I.e., if considered the removal, I'd simply drop the entire row where my target value was found outlying. Suppose if I am having outliers in some independent variables as well. What should I do there? Either, Should I ignore them? Or, Should I take the same approach with …
I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have: from sklearn.linear_model import SGDRegressor out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)] alpha=[0.0001, 1, 100] N= len(out) plt.figure(figsize=(20,15)) j=1 for i in alpha: X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included …
I am currently working with a bunch of classification models especially Logistic regression, KNN, Naive Bayes, SVM, and Decision Trees for my machine learning class. I know how to handle finding and removing the missing values and the outliers. But I would like to know which of the above models would perform really badly if the outliers and missing values are not removed. Like if I decide to leave the outliers and missing values in the dataset which model should …
I was wondering what is the best practice for removing outliers from data. Plotting a boxplot for each feature (column of the dataset) and removing data that fall outside the whiskers seems like a naive and problematic approach. For example, say you have many individuals with a 'gender' label and an 'income' label. Also assume that there are many more men in the dataset than women. Unfortunately, due to income disparity we may see that women receive a lower wage …