Various models giving 99% accuracy for KDDcup 99 dataset for Intrusion Detection, is there some sort of data leak I am missing?

Student who is quite new to all this here. I am currently working with the KDDcup 99 data for intrusion detection using various ML models (and ANN). My problem is that I am getting 99% often for accuracy. At the moment I am focusing mostly on binary classification (normal vs attack) I have identified problems in my data preprocessing methods and after fixing them I am more confident in the validity of my input data but I am still getting 99%'s which I no longer trust. (Especially since I just got 99% accuracy with an SVM with all default params).

My data should be balanced between the 2 classes so I would assume that if the machine was not learning then it would be getting around 50% accuracy. I feel like there's got to be a mistake I am making somewhere in here, or am I just underestimating the power of these ML algorithms?

Here are my preprocessing steps:

1-Remove duplicates from that data (about 75% of the dataset is duplicates)

2-Use random under sampling to balance hugely biased data by removing normal events. (50% normal 50% attack after this step)

3-Drop 1 feature with 0 variance

4-Shuffle data then split 70/30 train/test

5-One Hot Encode input features in training data that consist of strings (ex protocol type = [icmp, tcp. udp]) using CountVectorizer OR convert this data to simple integer dummy variables.

6-Z-score normalize numerical continuous data columns in training set with StandardScaler

7-Apply these trained normalization/encoder methods onto the testing set

My guess is there is some data leakage or one of the features is a direct/almost 1 to 1 correspondence with the classification.

Last night I was experimenting with some changes to see if anything would bring the accuracy down. Using a KNN I checked every feature as the only input. Since the dataset is balanced you should expect a 50% accuracy for each individual feature. Most features when checked individually did result in a 50% score but two of them stood out. When the only input was the 'flag' feature the model scored 95% and when the only feature was the 'service' feature it was 93%. Does this mean this data should be dropped?

Also I understand that there are flaws with this dataset. Perhaps the 99% is possible because this dataset is not an accurate representation of actual network traffic and is more of a toy dataset?

EDIT : I am wondering if perhaps the model is capable of actually reaching 99% because this data is inherently flawed(see below quote). I am not sure how to calculate TTL from the given features but if the model figured that out it could maybe explain the 99% binary case. In terms of multiclass maybe this also applies to distinguish between normal/attack attempts and then the remaining features help the model to figure out what type of attack it is from there.

In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254.

Topic anomaly-detection beginner machine-learning

Category Data Science


There is a serious issue in your approach: the data should not be resampled before splitting between training and test set. A model should always be evaluated on the "true" distribution of the data.

So the high performance that you obtain currently is not really relevant, but we can still try to diagnose it. Note that the high accuracy may actually be correct on the balanced data. But it's also possible that the deduplication doesn't catch all the possible cases of duplicates (in particular near-duplicates are hard to catch), causing data leakage in the test set.

The role of some specific features and whether to include them depends on the definition the task:

  • If they can be obtained "in production" , whatever that means for the task, there's no reason to throw a good indicator away.
  • If they can't be obtained in the expected production environment, it's clearly a mistake to include them.

In other words, a feature should not be dropped because it's a good indicator. However it should be dropped if it doesn't make sense for the model usage scenario.

Also it's not a great idea to use resampling at all, there are various questions around DSSE about this but this is not the topic here.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.