Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Big Data Kirk Borne George Mason University LSST All Hands Meeting August , 2012.
Astronomy of the Next Decade: From Photons to Petabytes R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST.
An Overview of Machine Learning
Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
Kirk Borne George Mason University, Fairfax, VA ● Dynamic Events in Massive Data Streams, from Astrophysics to Marketing.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
CS Instance Based Learning1 Instance Based Learning.
Data Mining – Intro.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Data Mining Mohammed J. Zaki.
Knowledge Discovery from Mining Big Data Kirk George Mason University School of Physics, Astronomy, & Computational Sciences
Early History of Supernovae S. R. Kulkarni Caltech Optical Observatories.
Review for Exam 3.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Informatics in Education and An Education in Informatics Kirk Borne Dept of Computational & Data Sciences George Mason University
1 Level of Significance α is a predetermined value by convention usually 0.05 α = 0.05 corresponds to the 95% confidence level We are accepting the risk.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 New Frontiers with LSST: leveraging world facilities Tony Tyson Director, LSST Project University of California, Davis Science with the 8-10 m telescopes.
Chapter 1 Introduction to Data Mining
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Astro / Geo / Eco - Sciences Illustrative examples of success stories: Sloan digital sky survey: data portal for astronomy data, 1M+ users and nearly 1B.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Chinese- international collaboration solved the central question: ”How common are planets like the Earth”
Astronomy, Petabytes, and MySQL MySQL Conference Santa Clara, CA April 16, 2008 Kian-Tat Lim Stanford Linear Accelerator Center.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
The Large Synoptic Survey Telescope: The power of wide-field imaging Michael Strauss, Princeton University.
LSST and VOEvent VOEvent Workshop Pasadena, CA April 13-14, 2005 Tim Axelrod University of Arizona.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
The Large Synoptic Survey Telescope Project Bob Mann LSST:UK Project Leader Wide-Field Astronomy Unit, Edinburgh.
The VAO is operated by the VAO, LLC. Ashish Mahabal Ciro Donalek Matthew Graham Ray Plante George Djorgovski.
Anomaly Detection.
Using Adaptive Tracking To Classify And Monitor Activities In A Site W.E.L. Grimson, C. Stauffer, R. Romano, L. Lee.
Principal Data Scientist Booz Allen Hamilton, Strategic Innovation Data for Discovery and Innovation: From Data Management to Digital.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Scientific Data Analysis via Statistical Learning Raquel Romano romano at hpcrd dot lbl dot gov November 2006.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Gina Moraila University of Arizona April 21, 2012.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
T. Axelrod, NASA Asteroid Grand Challenge, Houston, Oct 1, 2013 Improving NEO Discovery Efficiency With Citizen Science Tim Axelrod LSST EPO Scientist.
Data Mining – Intro.
A Personal Tour of Machine Learning and Its Applications
Intro to Machine Learning
School of Computer Science & Engineering
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Supervised Time Series Pattern Discovery through Local Importance
Data Warehousing and Data Mining
Knowledge Discovery from Mining Big Data
Find Your Opportunities in the Internet of Things 
Presentation transcript:

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

Astronomy: Data-Driven Science = Evidence-based Forensic Science

From Data-Driven to Data-Intensive Astronomy has always been a data-driven science It is now a data-intensive science: welcome to Astroinformatics ! –Data-oriented Astronomical Research = “the 4 th Paradigm” –Scientific KDD (Knowledge Discovery in Databases)

Astroinformatics Activities Borne (2010): “Astroinformatics: Data-Oriented Astronomy Research and Education”, Journal of Earth Science Informatics, vol. 3, pp Web home: Astro data mining papers: –“Scientific Data Mining in Astronomy” arXiv: arXiv: –“Data Mining and Machine Learning in Astronomy” arXiv: arXiv: Virtual Observatory Data Mining Interest Group (contact Astroinformatics Caltech, June ( Astroinformatics2010) NASA/Ames Conference on Intelligent Data October 5-7 Astro2010 Decadal Survey Position Papers: – Astroinformatics: A 21st Century Approach to Astronomy Astroinformatics: A 21st Century Approach to Astronomy – The Revolution in Astronomy Education: Data Science for the Masses The Revolution in Astronomy Education: Data Science for the Masses – The Astronomical Information Sciences: Keystone for 21st-Century Astronomy The Astronomical Information Sciences: Keystone for 21st-Century Astronomy – Wide-Field Astronomical Surveys in the Next Decade Wide-Field Astronomical Surveys in the Next Decade – Great Surveys of the Universe Great Surveys of the Universe

From Data-Driven to Data-Intensive Astronomy has always been a data-driven science It is now a data-intensive science: welcome to Astroinformatics ! –Data-oriented Astronomical Research = “the 4 th Paradigm” –Scientific KDD (Knowledge Discovery in Databases): Characterize the known (clustering, unsupervised learning) Assign the new (classification, supervised learning) Discover the unknown (outlier detection, semi-supervised learning) … Benefits of very large datasets: best statistical analysis of “typical” events automated search for “rare” events Scientific Knowledge !

Outlier Detection as Semi-supervised Learning Graphic from S. G. Djorgovski

Basic Astronomical Knowledge Problem Outlier detection: (unknown unknowns) –Finding the objects and events that are outside the bounds of our expectations (outside known clusters) –These may be real scientific discoveries or garbage –Outlier detection is therefore useful for: Novelty Discovery – is my Nobel prize waiting? Anomaly Detection – is the detector system working? Science Data Quality Assurance – is the data pipeline working? –How does one optimally find outliers in D parameter space? or in interesting subspaces (in lower dimensions)? –How do we measure their “interestingness”?

Outlier Detection has many names Outlier Detection Novelty Detection Anomaly Detection Deviation Detection Surprise Detection

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

LSST = Large Synoptic Survey Telescope meter diameter primary mirror = 10 square degrees! Hello ! (design, construction, and operations of telescope, observatory, and data system: NSF) (camera: DOE) (mirror funded by private donors)

LSST in time and space: – When? – Where? Cerro Pachon, Chile Model of LSST Observatory LSST Key Science Drivers: Mapping the Universe – Solar System Map (moving objects, NEOs, asteroids: census & tracking) – Nature of Dark Energy (distant supernovae, weak lensing, cosmology) – Optical transients (of all kinds, with alert notifications within 60 seconds) – Galactic Structure (proper motions, stellar populations, star streams, dark matter)

Observing Strategy: One pair of images every 40 seconds for each spot on the sky, then continue across the sky continuously every night for 10 years ( ), with time domain sampling in log(time) intervals (to capture dynamic range of transients). LSST (Large Synoptic Survey Telescope): –Ten-year time series imaging of the night sky – mapping the Universe ! –100,000 events each night – anything that goes bump in the night ! –Cosmic Cinematography! The New Education and Public Outreach have been an integral and key feature of the project since the beginning – the EPO program includes formal Ed, informal Ed, Citizen Science projects, and Science Centers / Planetaria.

LSST Summary Plan ( pending Decadal Survey ): commissioning in Gigapixel camera One 6-Gigabyte image every 20 seconds 30 Terabytes every night for 10 years 100-Petabyte final image data archive anticipated – all data are public!!! 20-Petabyte final database catalog anticipated Real-Time Event Mining: 10, ,000 events per night, every night, for 10 yrs –Follow-up observations required to classify these Repeat images of the entire night sky every 3 nights: Celestial Cinematography

The LSST will represent a 10K-100K times increase in the VOEvent network traffic. This poses significant real-time classification demands on the event stream: from data to knowledge! from sensors to sense!

MIPS model for Event Follow-up MIPS = –Measurement – I nference – Prediction – Steering Heterogeneous Telescope Network = Global Network of Sensors: –Similar projects in NASA, Earth Science, DOE, NOAA, Homeland Security, NSF DDDAS (voeventnet.org, skyalert.org) Machine Learning enables “IP” part of MIPS: –Autonomous (or semi-autonomous) Classification –Intelligent Data Understanding –Rule-based –Model-based –Neural Networks –Temporal Data Mining (Predictive Analytics) –Markov Models –Bayes Inference Engines

Example: The Thinking Telescope Reference:

From Sensors to Sense From Data to Knowledge: from sensors to sense (semantics) From Data to Knowledge: from sensors to sense (semantics) Data → Information → Knowledge

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD (work done in collaboration Arun Vedachalam)

Challenge: which data points are the outliers ?

Inlier or Outlier? Is it in the eye of the beholder?

3 Experiments

Experiment #1-A (L-TN) Simple linear data stream – Test A Is the red point an inlier or and outlier?

Experiment #1-B (L-SO) Simple linear data stream – Test B Is the red point an inlier or and outlier?

Experiment #1-C (L-HO) Simple linear data stream – Test C Is the red point an inlier or and outlier?

Experiment #2-A (V-TN) Inverted V-shaped data stream – Test A Is the red point an inlier or and outlier?

Experiment #2-B (V-SO) Inverted V-shaped data stream – Test B Is the red point an inlier or and outlier?

Experiment #2-C (V-HO) Inverted V-shaped data stream – Test C Is the red point an inlier or and outlier?

Experiment #3-A (C-TN) Circular data topology – Test A Is the red point an inlier or and outlier?

Experiment #3-B (C-SO) Circular data topology – Test B Is the red point an inlier or and outlier?

Experiment #3-C (C-HO) Circular data topology – Test C Is the red point an inlier or and outlier?

KNN-DD = K-Nearest Neighbors Data Distributions f K (d[x i,x j ])

KNN-DD = K-Nearest Neighbors Data Distributions f O (d[x i,O])

KNN-DD = K-Nearest Neighbors Data Distributions f O (d[x i,O]) f K (d[x i,x j ]) ≠

The Test: K-S test Tests the Null Hypothesis: the two data distributions are drawn from the same parent population. If the Null Hypothesis is rejected, then it is probable that the two data distributions are different. This is our definition of an outlier : –The Null Hypothesis is rejected. Therefore… – the data point’s location in parameter space deviates in an improbable way from the rest of the data distribution.

Advantages and Benefits of KNN-DD The K-S test is non-parametric –It makes no assumption about the shape of the data distribution or about “normal” behavior –It compares the cumulative distribution of the data values (inter-point distances)

Cumulative Data Distribution (K-S test) for Experiment 1A (L-TN)

Cumulative Data Distribution (K-S test) for Experiment 2B (V-SO)

Cumulative Data Distribution (K-S test) for Experiment 3C (C-HO)

Advantages and Benefits of KNN-DD The K-S test is non-parametric –It makes no assumption about the shape of the data distribution or about “normal” behavior KNN-DD: –operates on multivariate data (thus solving the curse of dimensionality) –is algorithmically univariate (by estimating a function that is based only on the distance between data points) –is computed only on a small-K local subsample of the full dataset N (K << N) –is easily parallelized when testing multiple data points for outlyingness

Results of KNN-DD experiments Experiment ID Short Description of Experiment KS Test p-value Outlier Index = 1-p = Outlyingness Likelihood Outlier Flag (p<0.05?) L-TN (Fig. 5a) Linear data stream, True Normal test %False L-SO (Fig. 5b) Linear data stream, Soft Outlier test % Potential Outlier L-HO (Fig. 5c) Linear data stream, Hard Outlier test %TRUE V-TN (Fig. 7a) V-shaped stream, True Normal test %False V-SO (Fig. 7b) V-shaped stream, Soft Outlier test % Potential Outlier V-HO (Fig. 7c) V-shaped stream, Hard Outlier test %TRUE C-TN (Fig. 9a) Circular stream, True Normal test %False C-SO (Fig. 9b) Circular stream, Soft Outlier test %TRUE C-HO (Fig. 9c) Circular stream, Hard Outlier test %TRUE The K-S test p value is essentially the likelihood of the Null Hypothesis.

Results of KNN-DD experiments Experiment ID Short Description of Experiment KS Test p-value Outlier Index = 1-p = Outlyingness Likelihood Outlier Flag (p<0.05?) L-TN (Fig. 5a) Linear data stream, True Normal test %False L-SO (Fig. 5b) Linear data stream, Soft Outlier test % Potential Outlier L-HO (Fig. 5c) Linear data stream, Hard Outlier test %TRUE V-TN (Fig. 7a) V-shaped stream, True Normal test %False V-SO (Fig. 7b) V-shaped stream, Soft Outlier test % Potential Outlier V-HO (Fig. 7c) V-shaped stream, Hard Outlier test %TRUE C-TN (Fig. 9a) Circular stream, True Normal test %False C-SO (Fig. 9b) Circular stream, Soft Outlier test %TRUE C-HO (Fig. 9c) Circular stream, Hard Outlier test %TRUE The K-S test p value is essentially the likelihood of the Null Hypothesis.

Future Work Validate our choices of p and K Measure the KNN-DD algorithm’s learning times Determine the algorithm’s complexity Compare the algorithm against several other outlier detection algorithms Evaluate the algorithm’s effectiveness on much larger datasets Demonstrate its usability on streaming data