Copyright © 2006, Brigham S. Anderson FDA Project: Anomaly and Temporal Pattern Detection Brigham Anderson Robin Sabhnani Adam Goode Alice Zheng Artur Dubrawski
2 OUTLINE Client Data Solutions Anomaly Detector Temporal Pattern Detector
3 The Players eLEXNET: Electronic Laboratory Exchange Network NBIS: National Bio-surveillance Integration System The Department of Homeland Security has “asked” the FDA to submit relevant eLEXNET data to NBIS.
4 Electronic Laboratory Exchange Network National Bio-surveillance Integration System Auton Lab SAIC?
5 Scenarios Scenario #1: Anchovy + Mercury
6 Anchovy/Mercury Summarization Report to FDA analyst…
7 Scenarios Scenario #1: Anchovy + Mercury Scenario #2: OJ + Salmonella
8 OK, so what does the data look like?
9 DATA Samples of food products Sample ID Collection Date Product Code Country Code Zip Code Reason Collected Human Illness? On order of 10,000 different products
10 DATA Each Sample consists of multiple Tests “Analyte” Detection Lab ID Test Method … Estimated 5,000 different analytes
11 Example Sample #223591: 2/18/2005 Coffee/Tea Analyte: Salmonella spp Detect: Negative Analyte: Staphylococcus aureus Detect: Negative Analyte: Bacillus cereus Detect: Negative
12 Data (Show spreadsheet)
13 Data Time span: 1999-present Number of records: 300K to 1 M? Missing data? …Only a few in the sample datasets provided. Different types of tests: Microbials Mycotoxins Pesticides Dyes …
14 Data Stream About 1200 Microbial tests submitted per week Tests are not submitted regularly!
15 Anomaly Detector Temporal Pattern Detector
16 What is an Anomaly? An irregularity that cannot be explained by simple domain models and knowledge Anomaly detection only needs to learn from examples of “normal” system behavior. Classification, on the other hand, would need examples labeled “normal” and “not-normal”
17 Anomaly Detectors in Practice Monitoring computer networks for attacks. Looking for suspicious activity in bank transactions Detecting unusual eBay selling/buying behavior.
18 Simple FDA Anomaly Detection GIVEN: 1 test = 1 record The relevant features of a test are Product Analyte Detect PROBLEM: For each test, compute P(product,analyte,detect) and explain it.
19 Simple Anomaly Detector Suppose we estimate all the probabilities from data: P(Meat,EColi,N) = P(Meat,EColi,Y) = P(Meat,Salmonella,N) = P(Meat,Salmonella,Y) = P(Apple,Vibrosa,N) = P(Apple,Vibrosa,Y) = P(Apple,Listeria,N) = P(Apple,Listeria,Y) = P(Product,Analyte,Detect) =
20 Simple Anomaly Detector How likely is ? Could not be easier! Just look up the entry in the JPT! Smaller numbers are more anomalous because the model is more surprised to see them.
21 Estimating P(product,analyte,detect) There are ~ 10,000 products. There are ~ 5,000 analytes. There are 2 detection outcomes. …so there are ~100M possible triplets. We cannot directly estimate P(product,analyte,detect) from the data…
22 P(product,analyte,detect) P(product,analyte,detect) = P(product) P(analyte|product) P(detect|product,analyte) P(Anchovy,Mercury,Y) = P(Anchovy) P(Mercury | Anchovy) P(Y | Anchovy, Mercury) e.g.,
23 Product ~10,000 values Analyte ~5,000 values Detect 2 values 10,000 x 1 vector 10,000 x 5,000 matrix 10,000 x 5,000 x 2 matrix
24 Two ways we handle insufficient data: Aggregate Products into “Industries” Dirichlet priors on CPTs
25 Product ~10,000 values Analyte ~5,000 values Detect 2 values 50 x 1 vector 50 x 5,000 matrix 50 x 5,000 x 2 matrix Industry ~50 values
26 Least Anomalous in 2005 Anomaly Score
27 Most Anomalous in 2005 Anomaly Score
28 Dirichlet priors How we add Dirichlet priors: 1.Before learning the CPTs, assume that we’ve seen every possible combination exactly “once”. 2.Continue learning the network.
29 Which Abstraction Level? There are about three levels of detail for a given product… E.g., Seafood Anchovy Smoked Achovy Currently, use P(Mercury | Seafood) …should we use P(Mercury | Anchovy) instead? …but what if we’ve only seen 4 Anchovy/Mercury tests? Do we use that to estimate P(Mercury | Anchovy) ?
30 Which Abstraction Level? There are about three levels of detail for a given product… E.g., Seafood Anchovy Smoked Achovy IDEA: 1.Build one anomaly detector for each level. 2.Test each sample at all three levels. 3.Choose the most anomalous score. Are you insane? Maybe not… At the lower levels, the anomaly score will tend to be dominated by the prior (and thus produce high probabilities.)
31 Anomaly Detector Temporal Pattern Detector
32 What is a Temporal Pattern? How find the Orange Juice + Salmonella pattern? This is not a daily scan, it is “on-demand”
33 What is a Temporal Pattern? BASIC PROBLEM #1: Check each product/analyte pair in the last t weeks against the previous t’ weeks for unusual “behavior”. BASIC SOLUTION: Chi-square test for each product/analyte pair: DetectsNon-Detects RecentO 11 O 12 BaselineO 21 O 22
vs Microbials only
35 What is a Temporal Pattern? BASIC PROBLEM #2: Check each product/analyte pair in the last t weeks for any interval of unusual behavior. BASIC SOLUTION: Chi-square test for each product/analyte pair for each interval (Bootstrap to get baseline distribution of best chi-square.) DetectsNon-DetectsDuration Inside#detects_inside, O 11 #non-detects_inside, O 12 #weeks_inside O 13 Outside#detects_outside, O 21 #non-detects_outside, O 22 #weeks_outside O 23
36 Patulin mycotoxin tests on Fruits
37 All years? Microbials only
38
39
40