Various models giving 99% accuracy for KDDcup 99 dataset for Intrusion Detection, is there some sort of data leak I am missing?
Student who is quite new to all this here. I am currently working with the KDDcup 99 data for intrusion detection using various ML models (and ANN). My problem is that I am getting 99% often for accuracy. At the moment I am focusing mostly on binary classification (normal vs attack) I have identified problems in my data preprocessing methods and after fixing them I am more confident in the validity of my input data but I am still getting 99%'s which I no longer trust. (Especially since I just got 99% accuracy with an SVM with all default params).
My data should be balanced between the 2 classes so I would assume that if the machine was not learning then it would be getting around 50% accuracy. I feel like there's got to be a mistake I am making somewhere in here, or am I just underestimating the power of these ML algorithms?
Here are my preprocessing steps:
1-Remove duplicates from that data (about 75% of the dataset is duplicates)
2-Use random under sampling to balance hugely biased data by removing normal events. (50% normal 50% attack after this step)
3-Drop 1 feature with 0 variance
4-Shuffle data then split 70/30 train/test
5-One Hot Encode input features in training data that consist of strings (ex protocol type = [icmp, tcp. udp]) using CountVectorizer OR convert this data to simple integer dummy variables.
6-Z-score normalize numerical continuous data columns in training set with StandardScaler
7-Apply these trained normalization/encoder methods onto the testing set
My guess is there is some data leakage or one of the features is a direct/almost 1 to 1 correspondence with the classification.
Last night I was experimenting with some changes to see if anything would bring the accuracy down. Using a KNN I checked every feature as the only input. Since the dataset is balanced you should expect a 50% accuracy for each individual feature. Most features when checked individually did result in a 50% score but two of them stood out. When the only input was the 'flag' feature the model scored 95% and when the only feature was the 'service' feature it was 93%. Does this mean this data should be dropped?
Also I understand that there are flaws with this dataset. Perhaps the 99% is possible because this dataset is not an accurate representation of actual network traffic and is more of a toy dataset?
EDIT : I am wondering if perhaps the model is capable of actually reaching 99% because this data is inherently flawed(see below quote). I am not sure how to calculate TTL from the given features but if the model figured that out it could maybe explain the 99% binary case. In terms of multiclass maybe this also applies to distinguish between normal/attack attempts and then the remaining features help the model to figure out what type of attack it is from there.
In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254.
Topic anomaly-detection beginner machine-learning
Category Data Science