Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005.

Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005

What is Anomaly Detection? Seek to find parts of the data (“anomalies”) that are different from the rest of the data “Supervised” approaches use examples of anomalies; “unsupervised” approaches do not.

How can Anomaly Detection be Applied to Scientific Data? Examples: –Data from Earth-observing satellites –Data from telescopes Direct scientists’ attention to anomalies – could lead to scientific discoveries Detect errors, so they can be corrected

Example Earth Science Application: Vegetation Data Joint work with Ranga Myneni of Boston University Used Leaf Area Index (LAI) & Fraction Absorbed of Photosynthetically Available Radiation (FPAR) from Moderate Resolution Imaging Spectroradiometer (MODIS) instrument aboard the Terra and Aqua satellites

Results Used MODIS data from one time point at 4 km resolution (7.7 million pixels within Earth’s land area) Used 4 variables: LAI, FPAR, QA, and latitude Used an unsupervised, distance-based anomaly detection algorithm The #1 outlier was in northern Russia and the #2 outlier was in southern New Zealand Both points had unusually high LAI and FPAR values for their latitudes Investigation revealed a bug in the software that produced the LAI and FPAR products Error was corrected, and new versions of the data were made available to the scientific community.

Algorithm Used: Orca (Distance-Based Outliers) The main idea is to find points in low density regions of the feature space x d V is the total volume within radius d N is the total number of examples k is the number of examples in sphere Joint work with Stephen Bay of ISLE

Orca Algorithm Based on nested loops –For each example, find it’s nearest neighbors with a sequential scan Modified with a pruning rule –While performing the sequential scan, Keep track of closest neighbors found so far prune examples once the neighbors found so far indicate that the example cannot be a top outlier Worst case O(N 2 ) distance computations In practice, runs in nearly linear time Can handle millions of data points

Conclusions Anomaly detection algorithms can find previously- unknown anomalies in large scientific data sets Could lead to scientific discoveries or correction of errors Different algorithms find qualitatively different anomalies, so it is worth running multiple algorithms I presented one algorithm (Orca) that runs in nearly linear time so it can be applied to very large data sets

Pruning Outliers based on distance to the 3rd nearest neighbor (k=3) x d sequential scan d is distance to 3 rd nearest neighbor for the weakest top outlier

Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005.

Similar presentations

Presentation on theme: "Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005.

Similar presentations

Presentation on theme: "Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005."— Presentation transcript:

Similar presentations

About project

Feedback