This data set is from the early years of Knowledge Discovery (KDD)/Data Mining, and it is representative for all security and other problems that deal with rare events.
The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and "good"
The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion
detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection
contest uses a version of this dataset.
Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN.
They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.
The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records.
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well
defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Attacks fall into four main categories:
Read more: http://kdd.ics.uci.edu/databases/kddcup99/task.html.
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer overflow" attacks;
- probing: surveillance and other probing, e.g., port scanning.
The Problem of Rare Events.
For data sets with highly imbalanced class distributions of prevalence lower than 1%, already, discriminating between positive (good) and negative (bad) cases starts getting practically and efficiently unresolvable
without not accepting exceptional high misclassification costs. It is clear that for prevalences p = 1% one gets a 99% classification accuracy by simply predicting all and every cases as negative, for p = 0.01%
the accuracy is an amazing 99.99%. To have a chance to identify also positive cases here, the misclassification costs of false negatives must be up to 10,000 times higher than those of false positive cases. This naturally
and intentionally (because FN costs have been increased) leads to significantly more false positives (if the model is not improbably 100% accurate). For an assumed specificity of now 95%, a sensitivity of 50% (which both are good, hard to obtain values) and a population of 100 Million, for example, this results in nearly
5 Million false positive cases (suspicious events) while 5,000 of the 10,000 actual threats are identified. This means, to filter out 5,000 threats one has to accept (and handle) 5 Million cases which are falsely under suspicion,
while the problem of discriminating positive from negative events basically remains (now 5,000 out of 5 Million) and the predictive power of such a model is at the same time very low (PPV = 0.1%; probability that an identified positive event is actually true)
and classification accuracy decreases from 99.99% to 95%. This
situation worsens even more for p < 0.01% and when also taking the effort costs and the consequences of false positive classifications into account, i.e., when making an overall and strict cost-benefit analysis.
Data set dimensions: Number of inputs: 37 Number of samples: 50 000 (+10k samples for prediction)
Prevalence: from 75.1% to 0.004%
Example of a model equation self-organized by Insights:
X42 = 93,3403 * exp(-(V42^2)) - 1,13979
V42 = -0,000194724X29 - 2,49948