V-detector: a real-valued negative selection algorithm Zhou Ji St. Jude Childrens Research Hospital
What is negative selection? Biological background: T cells, thymus Major steps: 1. Generate candidates randomly 2. Eliminate those that recognize self samples
Main steps Generation detection
What is matching rule? When a sample and a detector are considered matching. Matching rule plays an important role in negative selection algorithm. It largely depends on the data representation.
In real-valued representation, detector can be visualized as hyper-sphere. Candidate 1: thrown-away; candidate 2: made a detector. Match or not match?
Main idea of V-detector By allowing the detectors to have some variable properties, V-detector enhances negative selection algorithm from several aspects: It takes fewer large detectors to cover non-self region – saving time and space Small detector covers holes better. Coverage is estimated when the detector set is generated. The shapes of detectors or even the types of matching rules can be extended to be variable too.
Main concept of Negative Selection and V-detector Constant-sized detectorsVariable-sized detectors
Outline of the algorithm (generation of variable-sized detector set)
Detector Set Generation Algorithm Constant-sized detectors Variable-sized detectors
Screenshots of the software Message view Visualization of data points and detectors
Experiments and Results Synthetic Data 2D. Training data are randomly chosen from the normal region. Fishers Iris Data One of the three types is considered as normal. Biomedical Data Abnormal data are the medical measures of disease carrier patients. Air Pollution Data Abnormal data are made by artificially altering the normal air measurements Ball bearings: Measurement: time series data with preprocessing - 30D and 5D
Synthetic data - Cross-shaped self space Shape of self region and example detector coverage (a) Actual self space (b) self radius = 0.05 (c) self radius = 0.1
Synthetic data - Cross-shaped self space Results Detection rate and false alarm rateNumber of detectors
Error rates
Synthetic data - Ring-shaped self space Shape of self region and example detector coverage (a) Actual self space (b) self radius = 0.05 (c) self radius = 0.1
Synthetic data - Ring-shaped self space Results Detection rate and false alarm rateNumber of detectors
Iris data Comparison with other methods: performance Detection rateFalse alarm rate Setosa 100%MILA NSA (single level)100 0 V-detector Setosa 50%MILA NSA (single level) V-detector Versicolor 100%MILA NSA (single level) V-detector Versicolor 50%MILA NSA (single level) V-detector Virginica 100%MILA NSA (single level) V-detector Virginica 50%MILA NSA (single level) V-detector
Iris data Comparison with other methods: number of detectors meanmaxMinSD Setosa 100% Setosa 50% Veriscolor 100% Versicolor 50% Virginica 100% Virginica 50%
Iris data Virginica as normal, 50% points used to train Detection rate and false alarm rateNumber of detectors
Biomedical data Blood measure for a group of 209 patients Each patient has four different types of measurement 75 patients are carriers of a rare genetic disorder. Others are normal.
Biomedical data: results comparison Training DataAlgorithmDetection RateFalse Alarm rateNumber of Detectors MeanSDMeanSDMeanSD 100% trainingMILA * 0 NSA r= r= % trainingMILA * 0 NSA r = r= % trainingMILA * 0 NSA r= r=
Biomedical data Detection rate and false alarm rateNumber of detectors
Air pollution data Totally 60 original records. Each is 16 different measurements concerning air pollution. All the real data are considered as normal. More data are made artificially: 1. Decide the normal range of each of 16 measurements 2. Randomly choose a real record 3. Change three randomly chosen measurements within a larger than normal range 4. If some the changed measurements are out of range, the record is considered abnormal; otherwise they are considered normal Totally 1000 records including the original 60 are used as test data. The original 60 are used as training data.
Air pollution data Detection rate and false alarm rateNumber of detectors
Ball bearing data raw data: time series of acceleration measurements Preprocessing (from time domain to representation space for detection) 1. FFT (Fast Fourier Transform) with Hanning windowing: window size Statistical moments: up to 5 th order
Example of data (raw data of new bearings) --- first 1000 points
Example of data (FFT of new bearings) --- first 3 coefficients of the first 100 points
Example of data (statistical moments of new bearings) --- moments up to 3rd order of the first 100 points
Ball bearings structure and damage Damaged cage
Ball bearing data: results Ball bearing conditionsTotal number of data pointsNumber of detected anomalies Percentage detected New bearing (normal)273900% Outer race completely broken % Broken cage with one loose element % Damage cage, four loose elements % No evident damage; badly worn % Ball bearing conditionsTotal number of data pointsNumber of detected anomalies Percentage detected New bearing (normal)265100% Outer race completely broken % Broken cage with one loose element % Damage cage, four loose elements289200% No evident damage; badly worn289200% Preprocessed with FFT Preprocessed with statistical moments
Ball bearing data: performance summary
New development of this work A new algorithm to generate variable-sized detectors. Purpose: reduce the possible false negative at the boundary of self region Why the issue exits: some self samples may be very close to the boundary. Main idea: differentiate between internal self samples and boundary self samples Solution: combine the advantage of the algorithms to generate variable-sized and constant-sized detectors described previously.
How much one sample tells
Samples may be on boundary
In term of detectors
Comparing three methods Constant-sized detectors V-detector New algorithm Self radius = 0.05
Comparing three methods Constant-sized detectors V-detectorsNew algorithm Self radius = 0.1
Work ongoing Estimate of coverage using formal statistics point estimate is the simplest method. Two types of statistical inference: 1. Confidence interval 2. Hypothesis testing
Point estimate of proportion
Summary 1. V-detector uses fewer detectors to obtain similar coverage. 2. Smaller detectors are more acceptable if the total number of detectors are largely controlled. 3. Coverage estimate is superior to fixed number of detectors. 4. V-detector can deal with high-dimensional data, including time series, better. 5. Self radius and estimated coverage are the two control parameters in V-detector. 6. Variable size, variable shape, variable matching rules, or other variable properties of detectors provide encouraging opportunity to enhance negative selection mechanism.