outlier

Confusion on Outliers

exp post

2022年6月1日 03:08

I am not able to distinguish outliers: when to go with the std. dev. or when we need to go with the median. My understanding on std. dev. is: if the data point is away from the mean by more than 2 std. dev., we consider that as an outlier. Similarly for the median, we say that any data point that is not in-between Q1 and Q3 is an outlier. So I am confused as to which one to choose. …

Topic: outlier statistics machine-learning

Category: Data Science

What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

Ross leavitt

2022年5月31日 23:07

I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told …

Topic: unsupervised-learning outlier algorithms machine-learning

Category: Data Science

Isolation Forest height limit absent in SkLearn implementation

jimijazz

2022年5月25日 19:36

In the original publication of the Isolation Forest algorithm, the authors mention a height limit parameter to control the granularity of the algorithm. I did not find that explicit parameter on the Sklearn implementation of the algorithm, and I was wondering whether it is possible to control granularity in some other way?

Topic: decision-trees outlier scikit-learn

Category: Data Science

What can help decrease outliers' influence on non-tree models?

Revolucion for Monica

2022年5月20日 03:05

I have a feature with all the values between 0 and 1 except few outliers larger than 1. I am trying to collect all the methods that can help to decrease outliers' influence on non-tree models: StandardScaler Apply rank transform to the features Apply np.log1p(x) transform to the data MinMaxScaler Winsorization I wasn't able to imagine any other ... I guess that's all?

Topic: preprocessing ranking outlier

Category: Data Science

Feature Selection and Outlier Detection

Payal Bhatia

2022年5月18日 09:07

How does feature selection impact outlier detection and also, removing outliers impact feature selection? It could be a basic question. However, just to know the boundaries, I asked. Thanks in advance. I have gone through the following:Feature selection and outlier order

Topic: outlier feature-selection statistics machine-learning

Category: Data Science

Anomaly Detection

saurav kumar singh

2022年5月8日 17:42

I have a problem where I want to identify Vendors with unusual high amount invoices. What would be the best way to identify such invoices? I am trying to use Isolation Forest but having trouble in grouping by the result by Vendor. Any help will be appreciated. Data is in below format . Vendor ID Amount 1 456 2 1000 1 489 3 896 2 4576

Topic: isolation-forest anomaly-detection outlier machine-learning

Category: Data Science

Algorithm suggestion for anomaly detection in multivariate time series data

himadri

2022年5月1日 15:07

I have time series data containing user actions at certain time intervals eg Date UserId Directory operation Result 01/01/2017 99:00 user1 dir1 created_file success 01/01/2017 99:00 user3 dir10 deleted_file permission_denied unique userIds > 10K 10 distinct operations and 4 distinct Results I need to perform anomaly detection on user behavior in real time. Any suggestions on which method I should use? The anomaly needs to flag whether some user operations are outliers A very small subset of input data will …

Topic: anomaly-detection outlier time-series machine-learning

Category: Data Science

Sensitivity analysis in outlier explanation

Shashank Srivastava

2022年4月27日 01:00

I am trying to find the outlier explanation using the sensitivity analysis. Let’s consider that my dataset contains 19 different input values and 1 output value (So overall 20 different columns are there and values are numerical). I have already made a prediction model and I am considering the values with high prediction errors are outliers/ anomalies. I have done the sensitivity analysis for individual input values but in the dataset values are correlated with some other input values, e.g. …

Topic: outlier python bigdata data-mining

Category: Data Science

Isolation forest - grouped by

codecodecode

2022年4月23日 06:01

I'm trying to use isolation forest algorithm for outliers detection. Data has 2 columns: id and REV. Below code gives me ungrouped result. Could you pls advise how to get result grouped by first column (id)? data= pd.read_excel (my_path) outliers_fraction=0.1 scaler = StandardScaler() np_scaled = scaler.fit_transform(data) data = pd.DataFrame(np_scaled) model =IsolationForest(contamination=outliers_fraction) model.fit(data) data['anomaly'] = pd.Series(model.predict(data)) print(data) I have 2 columns: id and REV. I added a picture of what I expect to see as the final result: Tried to use …

Topic: outlier random-forest scikit-learn python

Category: Data Science

Explainable anomaly detection

EuRBamarth

2022年4月22日 06:36

There are plenty of working for explaining prediction in supervised learning (e.g. SHAP values, LIME). What about for anomaly detection in unsupervised learning? Is there any model for which there are libraries that can give you justifications, such as "row x is an anomaly because feature 1 is higher than 5.3 and feature 5 is equal to 'No'"?

Topic: explainable-ai anomaly-detection outlier

Category: Data Science

How to use Autoencoders for outlier detection on images

HHH

2022年4月22日 02:02

I have a bunch of images taken from a camera showing a pipe and would like to detect if the pipe is leaking or not. There are very few examples of leaking pipes in the data set. So considering this problem as a supervised learning problem, I think that it may not give us good results due to imbalanced data. I am thinking of using autoencoders and considering it as an outlier detection problem. I am new to deep learning …

Topic: autoencoder image-classification outlier deep-learning

Category: Data Science

Given daily sequence of events with only event ID labels (alphanum strings), what algorithms can be used to detect sequences that are outliers?

demoman

2022年4月20日 15:16

For example, the data might be something like this: Sequence 1: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] Sequence 2: ["CBA", "AAA", "YY123", "LMNOP", "AABBCC"] Sequence 3: ["ABC", "AAA", "ZZ123", "RRZZZ45", "AABBCC"] ... Sequence N: ["DEF", "AAA", "ZZ123", "YYZZZ45", "AABBCC"] Sequence 1 and 3 are the same, but sequence 2 and N are different. In the data set, there will be thousands of these sequences every day. Additional questions: How could I calculate similarity (or difference) measure between sequences with sequences of …

Topic: labels distance sequence outlier clustering

Category: Data Science

Replace Values in Vector on Specific Place in R

Angel

2022年4月20日 03:06

I want to make $5^{th}$,$10^{th}$,$15^{th}$,$20^{th}$ and $25^{th}$ values of vector an outlier in all xs by using x1 [5]+OT1,x1 [10]+OT1 and so on. For this purpose I have made this R code, n=25 x1<-runif(n,0,1) x2<-runif(n,0,1) x3<-runif(n,0,1) x4<-runif(n,0,1) x<-data.frame(x1,x2,x3,x4) OT1<-mean(x1)+100 OT2<-mean(x2)+100 OT3<-mean(x3)+100 OT4<-mean(x4)+100 I have tried command replace() and also modify(), but none of them replace them at once at least in one vector. Kindly help me in this manner. Edit by using comment of @user2974951 I tried this x1[seq(5,25,5)]=x1[seq(5,25,5)]+100 Nx1<-replace(x1,x1==x1[5],x1 …

Topic: outlier statistics r

Category: Data Science

Can we use doc2vec to detect outlier documents?

J Cena

2022年4月18日 12:03

I have a set of documents and I want to identify and remove the outlier documents. I am just wondering if doc2vec can be used for this task. Or are there any recently evolved, promising algorithms that I can use for this task? EDIT I am currently using a bag of words model to identify outliers.

Topic: gensim word2vec outlier nlp data-mining

Category: Data Science

Many separation line using RBF kernel in SVM

E199504

2022年4月17日 11:00

Below is my code, it take a range of a number, creates a new column label that contains either -1 or 1. In case the number is higher than 14000 , we label it with -1 (outlier) In case the number is lower than 14000 , we label it with 1 (normal) ## Here I just import all the libraries and import the column with my dataset ## Yes, I am trying to find anomalies using only the data from …

Topic: kernel anomaly-detection outlier svm python

Category: Data Science

Real-Time Outlier/Anomaly Detection?

newbieAtLife5741

2022年4月17日 07:07

My data is the usage/playing statistics for players of a specific game. One data point for a user is aggregated statistics for one week. The goal is to be able to detect when the account of the player was stolen/hacked/anything else went wrong. So my idea is for each player to have data points that each represent one week and then check whether the latest week is an outlier in the cluster. If it is - something is wrong with …

Topic: unsupervised-learning anomaly-detection outlier statistics clustering

Category: Data Science

Should outliers be removed only from the target variable or from any variable where they are found?

letdatado

2022年4月17日 04:04

What I often do is that I check boxplots and histograms for target/dependent variable and after much caution, treat/remove the outliers. But this is what I do only for the target variable. I.e., if considered the removal, I'd simply drop the entire row where my target value was found outlying. Suppose if I am having outliers in some independent variables as well. What should I do there? Either, Should I ignore them? Or, Should I take the same approach with …

Topic: feature-scaling outlier statistics data-cleaning machine-learning

Category: Data Science

How variable alpha changes SGDRegressor behavior for outlier?

Ross_you

2022年4月16日 06:26

I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have: from sklearn.linear_model import SGDRegressor out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)] alpha=[0.0001, 1, 100] N= len(out) plt.figure(figsize=(20,15)) j=1 for i in alpha: X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included …

Topic: sgd hyperparameter-tuning outlier python

Category: Data Science

How do outliers and missing values impact these classifiers?

Vishnu dut

2022年4月14日 03:29

I am currently working with a bunch of classification models especially Logistic regression, KNN, Naive Bayes, SVM, and Decision Trees for my machine learning class. I know how to handle finding and removing the missing values and the outliers. But I would like to know which of the above models would perform really badly if the outliers and missing values are not removed. Like if I decide to leave the outliers and missing values in the dataset which model should …

Topic: missing-data outlier

Category: Data Science

How to remove outliers properly?

Erik M

2022年4月8日 18:35

I was wondering what is the best practice for removing outliers from data. Plotting a boxplot for each feature (column of the dataset) and removing data that fall outside the whiskers seems like a naive and problematic approach. For example, say you have many individuals with a 'gender' label and an 'income' label. Also assume that there are many more men in the dataset than women. Unfortunately, due to income disparity we may see that women receive a lower wage …

Topic: preprocessing outlier reference-request python data-cleaning

Category: Data Science

About