STATISTICAL METHODS AND DATA MANAGEMENT TOOLS FOR OUTLIER DETECTION IN TRI DATA Dr. Nagaraj K. Neerchal and Justin Newcomer Department of Mathematics and.

Slides:



Advertisements
Similar presentations
Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Advertisements

Objectives Uncertainty analysis Quality control Regression analysis.
Anomaly Detection in Problematic GPS Time Series Data and Modeling Dafna Avraham, Yehuda Bock Institute of Geophysics and Planetary Physics, Scripps Institution.
1 Outliers and Influential Observations KNN Ch. 10 (pp )
Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture 15 Analysis of Data from Fractional Factorials and Other Unbalanced.
1 Incorporating Statistical Process Control and Statistical Quality Control Techniques into a Quality Assurance Program Robyn Sirkis U.S. Census Bureau.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Chapter 17 Overview of Multivariate Analysis Methods
Chapter Seventeen Copyright © 2006 McGraw-Hill/Irwin Data Analysis: Multivariate Techniques for the Research Process.
An Introduction to Stochastic Reserve Analysis Gerald Kirschner, FCAS, MAAA Deloitte Consulting Casualty Loss Reserve Seminar September 2004.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Econ 140 Lecture 121 Prediction and Fit Lecture 12.
Chapter 12 Simple Regression
BA 555 Practical Business Analysis
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Intro to Statistics for the Behavioral Sciences PSYC 1900
MAE 552 Heuristic Optimization Instructor: John Eddy Lecture #20 3/10/02 Taguchi’s Orthogonal Arrays.
Regression Diagnostics - I
First a digression The POC Ranking the Methods Jennie Watson-Lamprey October 29, 2007.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
1 Multivariate Normal Distribution Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Relationships Among Variables
Conditions of applications. Key concepts Testing conditions of applications in complex study design Residuals Tests of normality Residuals plots – Residuals.
Correlation & Regression
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection.
Simple Linear Regression
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
(a.k.a: The statistical bare minimum I should take along from STAT 101)
Basic concepts in ordination
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 Multivariate Analysis (Source: W.G Zikmund, B.J Babin, J.C Carr and M. Griffin, Business Research Methods, 8th Edition, U.S, South-Western Cengage Learning,
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 Linear Regression.
Chapter 4 Control Charts for Measurements with Subgrouping (for One Variable)
1 Review Sections Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central.
Methodology of Allocating Generic Field to its Details Jessica Andrews Nathalie Hamel François Brisebois ICESIII - June 19, 2007.
IMPROVING ACTUARIAL RESERVE ANALYSIS THROUGH CLAIM-LEVEL PREDICTIVE ANALYTICS 1 Presenter: Chris Gross.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Workshop on Price Index Compilation Issues February 23-27, 2015 Data Collection Issues Gefinor Rotana Hotel, Beirut, Lebanon.
Scientific Methods and Terminology. Scientific methods are The most reliable means to ensure that experiments produce reliable information in response.
Analysis of Chromium Emissions Data Nagaraj Neerchal and Justin Newcomer, UMBC and OIAA/OEI and Mohamed Seregeldin, Office of Air Quality Planning and.
Ch14: Linear Least Squares 14.1: INTRO: Fitting a pth-order polynomial will require finding (p+1) coefficients from the data. Thus, a straight line (p=1)
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Correlation & Regression Analysis
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
E-PRTR incompleteness check Irene Olivares Industrial Pollution Group Air and Climate Change Programme Eionet NRC workshop on Industrial Pollution Copenhagen.
Economics 173 Business Statistics Lecture 18 Fall, 2001 Professor J. Petry
MODEL DIAGNOSTICS By Eni Sumarminingsih, Ssi, MM.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 6 Diagnostics for Leverage and Influence
Regression Analysis Part D Model Building
Determining How Costs Behave
Chapter Six Normal Curves and Sampling Probability Distributions
Statistical Data Analysis
Chapter 8 Part 2 Linear Regression
SCIENCE AND ENGINEERING PRACTICES
Bootstrapping Jackknifing
Statistical Data Analysis
ADVANCED ANOMALY DETECTION IN CANARY TESTING
Presumptions Subgroups (samples) of data are formed.
Presentation transcript:

STATISTICAL METHODS AND DATA MANAGEMENT TOOLS FOR OUTLIER DETECTION IN TRI DATA Dr. Nagaraj K. Neerchal and Justin Newcomer Department of Mathematics and Statistics, UMBC and Barry Nussbaum Office of Environmental Information, US EPA

Background Challenges with TRI Data Self Reported Data Compare Facilities to its “Peers” Objectives: Investigate the use of statistical methods in identifying anomalous data (outliers) in the TRI database Develop data management tools to help in the outlier detection process

Comparison Within Peer Groups Statistical methods, appropriately modified, are useful in identifying potential outliers which are not evident by examining the total release alone High releases do not necessarily indicate problems with the reported data Need to cluster the data into groups of peers: Facilities in the same “peer group” are expected to have similar release values

Statistical Approach Analyze facilities within their own peer groups Fit an ANOVA model for total emission releases: Many ways to estimate the total release ANOVA Model, Jackknife Technique,... Obtain a residual for a facility by comparing the actual emissions to the predictions based on its peers: Facilities reporting on the same chemical are considered “peers”

Jackknife Technique Facilities Reporting on chemical j : The predicted value of the release of chemical j for facility i :

Studentized Residuals give us a unitless measure to compare facilities reporting on different chemicals: For the Jackknife technique we have, Define a metric that allows us to compare facilities that report on a different number of chemicals: Defining a Metric

Flagging the Outliers Further investigate the facilities corresponding to extreme values of a defined metric: Outliers are not necessarily wrong - just a place to look

Flagging the Outliers Is picking out the top 5 enough? Quick and Easy May not sufficiently represent all outliers Can we set a cutoff point? Define c such that if then we consider facility i an outlier Theoretical work can be done to examine properties of these metrics

Flagging the Outliers Are there other metrics or distances we can use to compare facilities? A multivariate analogue to the metric defined previously: This distance depends on the number of chemicals a facility reports on ( k ) Conditionally, given k, this distance follows a F -distribution What can we say about the marginal distribution? From here if we estimate we can use the percentiles of the noncentral F -distribution to find a cutoff point c

Flagging the Outliers We can consider, Estimate from the observed distribution of the number of chemicals being reported on:

TRI Trend Tool The TRI Trend Tool provides an easy way to acquire yearly emissions data on TRI facilities from 1995 through 2003 Allows the user to group the data by Chemical, State, and SIC Codes Provides total emissions data over all facilities or individual data for each facility Incorporates metrics that allow the user to compare facilities regardless of type or number of chemicals being reported on

Functionality The tool allows the user to obtain subsets of facilities, calculate totals, and identify possible outliers

Grouping Variables The user can group records by Chemical, SIC Code, State, or any combination of the three

Create Multiple Subsets The tool provides data over multiple years Subsets can be requested for multiple levels of a grouping variable

Example 1

Subsets can be created using multiple grouping variables Create Multiple Subsets

Example 2

The user can view and save facility level data Create Multiple Subsets

Example 3

Incorporating the Metrics The user can identify the top 5 facilities with extreme values of the metrics defined previously The outlier detection process can be refined by grouping by State and/or SIC Code

Example 4