which statistical parameters are more useful to detect anomalies and outlier? mean max min var?

This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an idea to which statistical parameter is more useful for it? mean max min median var or any parameters of these 151 sampling? Which parameter I should use? (I …
Category: Data Science

How to fit a model on validation_data?

can you help me understand this better? I need to detect anomalies so I am trying to fit an lstm model using validation_data but the losses does not converge. Do they really need to converge? Does the validation data should resemble train or test data or inbetween? Also, which value should be lower, loss or val_loss ? Thankyou!
Category: Data Science

Cross-validation for anomaly detection on time series data

I want to perform k-fold cross-validation for the setting where I have a training dataset consisting of a sequential time series that is fully benign and a test dataset (also a sequential time series) which contains labeled anomalies. I already took a look at this post, but as my data is sequential, the answer doesn't work out. I am especially stuck with the factor that for K-fold cross-validation, you use (k-1)/k parts of your data for training and 1/k parts …
Category: Data Science

Isolation Forest Score Function Theory

I am currently reading this paper on isolation forests. In the section about the score function, they mention the following. For context, $h(x)$ is definded as the path length of a data point traversing an iTree, and $n$ is the sample size used to grow the iTree. The difficulty in deriving such a score from $h(x)$ is that while the maximum possible height of iTree grows in the order of $n$, the average height grows in the order of $log(n)$. …
Category: Data Science

An Unsupervised learning method suitable for large categorical data sets

I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values. This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes …
Category: Data Science

Anomaly detection and replacing it with past values in time series

I am trying to use anomaly detection to find the anomalies in my time series, and if I find it, I will replace it with my past values. I'm trying to do this because I want to create an upper and lower bound to replace those anomalies and by using the past values will help me to create this bound. Is there any guidance or example, where I can learn to do this? Thanks!
Category: Data Science

Incremental learning on Autoencoder for anomaly detection

I want to incrementally train my pre-trained autoencoder model on data being received every minute. Based on this thread, successive calls to model.fit will incrementally train the model. However, the reconstruction error and overall accuracy of my model seems to be getting worse than what it initially was. The code looks something like this: autoencoder = load_pretrained_model() try: while True: data = collect_new_data() autoencoder = train_model(data) # Invokes autoencoder.fit() time.sleep(60) except KeyboardInterrupt: download_model(autoencoder) sys.exit(0) The mean reconstruction error when my …
Category: Data Science

Decision trees for anomaly detection

Problem From what I understand, a common method in anomaly detection consists in building a predictive model trained on non-anomalous training data, and perform anomaly detection using the error of the model when predicting on the observed data. This method requires the user to identify non-anomalous data beforehand. What if it's not possible to label non-anomalous data to train the model? Is there anything in literature that explain how to overcome this issue? I have an idea, but I was …
Category: Data Science

is it good to have 100% accuracy on validation?

i'm still new in machine learning. currently i'm creating an anomaly detection for flight data. it is a multivariate time series data that include timestamp, latitude, longitude, velocity and altitude of the aircraft. i'm splitting the data into train and test with 80% ratio. i used the keras LSTM autoencoder to do a anomaly detection. so here's my code def create_sequence(data, time_step = None): Xs = [] for i in range (len(data) - time_step): Xs.append(data[i:(i + time_step)]) return np.array(Xs) # …
Category: Data Science

How to compute threshold?

I would like to detect anomalies for univariate time series data. Most examples on internet show that, after you predict the model, you calculate a threshold for the training data and a MAE test loss and compare them to detect anomalies. So I am thinking is this the correct way of doing it? Shouldn't it be a different threshold value for each data? Also, why do all of the examples only compute MAE loss for anomalies?
Category: Data Science

Anomaly detection - relation between thresholds and anomalies

I'm developing an anomaly detection program in Python. Main idea is to create a new LSTM model every day, training it with the previous 7 days and predict the next day. Then, using thresholds, find anomalies day by day. I've already implemented that and these thresholds are working well: upper threshold is equals to trimmed_mean + (K * interquartile_range) lower threshold is equals to trimmed_mean - (K * interquartile_range) where trimmed_mean and interquartile_range are calculated on prediction error (real curve …
Category: Data Science

An autoencoder setup for anomaly detection

I am doing anomaly detection using machine learning. i have tried different models like isolation forest, SVM and KNN. The maximum accuracy that I can get from each of them is $80\%$ accordind to my dataset which contains $5$ features and $4000$ data samples, $18\%$ of them are anomalous. When I use autoencoder and I adjust the proper reconstruction loss threshold I can get $92\%$ accuracy but the hidden layers setup of the autoencoder does not seems right despite the …
Category: Data Science

Is it impossible to predict defects with data that are not labeled?

There is manufacturing data with 10 process variables. Normal and bad labeling are not done. It's tabular fdata. Do you have a paper that only uses data that are not labeled to predict defects or to find variables that affect them? I thought about using the Outlier Detection Algorithm (Isolation Forest, Autoencoder) to predict defects, but I can't find a way because I don't know the exact defect rate. I can't think of a way to verify it, so I'd …
Category: Data Science

is there a way to check if i got a "good price" on something?

I'm looking at some data. Actually, the Boston Housing dataset is probably a good proxy for it: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html I'm wondering if there's a way to predict if I got a "good price" given certain conditions. So something like, if I'm given a tuple such as CRIM, ZN, INDUS = (0.006320,18, 2.31), then is a house price of 50 significantly higher or lower than expected? This isn't quite vanilla anomaly detection, because the combination of a particular CRIM, ZN, INDUS may …
Category: Data Science

Detecting abundance of a certain periodic pattern in a time series?

I am really stumped at the moment about how to solve a particular problem. I have many time series like this: This represents the number of hours a person spends on a website each day throughout the year. Any days where they are not seen to be using the website have zero values, rather than missing values. What I really want to do is to calculate a metric telling me to what extent there is a consistent "1 hour per …
Category: Data Science

unsupervised anomaly detection for univariate fast frequency time series data?

I have a univariate time series (there is a value for each time sampling) (sampling time: 66.66 micro second, number of samples/sampling time=151) coming from a scala customer This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an …
Category: Data Science

How to detect anomalies?

I have timeseries data with one value per day for a year. (there is one column with temperature data). I am using autoencoders to train a reconstruction model with mse loss. Firstly, I normalized the data using the following code: training_mean = preprocessed_data.mean() training_std = preprocessed_data.std() df_training_value = (preprocessed_data - training_mean) / training_std After this I make a sequence with data. I am not sure if it's ok to choose 32 time stepts, but otherwise I can't fit the model. …
Category: Data Science

Which machine learning technique can be used for predictive log analysis

I have log data with 100k records. And These parameters. It looks like this. message types can be helpful for anomaly type detection. Out of total 15 message 5 message considered as anomaly. e.g. invalid user, connection closed by invalid user. Option 1 - Text classification model Create a classification model using text message, where it classifies the record based on message text. But I want to to use predictive analytics using date/time parameters so that it can help for …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.