Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005.

Similar presentations


Presentation on theme: "Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005."— Presentation transcript:

1 Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005

2 What is Anomaly Detection? Seek to find parts of the data (“anomalies”) that are different from the rest of the data “Supervised” approaches use examples of anomalies; “unsupervised” approaches do not.

3 How can Anomaly Detection be Applied to Scientific Data? Examples: –Data from Earth-observing satellites –Data from telescopes Direct scientists’ attention to anomalies – could lead to scientific discoveries Detect errors, so they can be corrected

4 Example Earth Science Application: Vegetation Data Joint work with Ranga Myneni of Boston University Used Leaf Area Index (LAI) & Fraction Absorbed of Photosynthetically Available Radiation (FPAR) from Moderate Resolution Imaging Spectroradiometer (MODIS) instrument aboard the Terra and Aqua satellites

5 Results Used MODIS data from one time point at 4 km resolution (7.7 million pixels within Earth’s land area) Used 4 variables: LAI, FPAR, QA, and latitude Used an unsupervised, distance-based anomaly detection algorithm The #1 outlier was in northern Russia and the #2 outlier was in southern New Zealand Both points had unusually high LAI and FPAR values for their latitudes Investigation revealed a bug in the software that produced the LAI and FPAR products Error was corrected, and new versions of the data were made available to the scientific community.

6 Algorithm Used: Orca (Distance-Based Outliers) The main idea is to find points in low density regions of the feature space x d V is the total volume within radius d N is the total number of examples k is the number of examples in sphere Joint work with Stephen Bay of ISLE

7 Orca Algorithm Based on nested loops –For each example, find it’s nearest neighbors with a sequential scan Modified with a pruning rule –While performing the sequential scan, Keep track of closest neighbors found so far prune examples once the neighbors found so far indicate that the example cannot be a top outlier Worst case O(N 2 ) distance computations In practice, runs in nearly linear time Can handle millions of data points

8 Conclusions Anomaly detection algorithms can find previously- unknown anomalies in large scientific data sets Could lead to scientific discoveries or correction of errors Different algorithms find qualitatively different anomalies, so it is worth running multiple algorithms I presented one algorithm (Orca) that runs in nearly linear time so it can be applied to very large data sets

9 Pruning Outliers based on distance to the 3rd nearest neighbor (k=3) x d sequential scan d is distance to 3 rd nearest neighbor for the weakest top outlier


Download ppt "Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005."

Similar presentations


Ads by Google