Using cluster analysis for Identifying outliers and possibilities offered when calculating Unit Value Indices OECD NOVEMBER 2011 Evangelos Pongas.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Class 14 Testing Hypotheses about Means Paired samples 10.3 p
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Measures of Central Tendency& Variability.
Descriptive Statistics: Numerical Measures
Measures of Central Tendency. Central Tendency “Values that describe the middle, or central, characteristics of a set of data” Terms used to describe.
Statistics for the Social Sciences
Descriptive (Univariate) Statistics Percentages (frequencies) Ratios and Rates Measures of Central Tendency Measures of Variability Descriptive statistics.
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data Lesson2-1 Lesson 2: Descriptive Statistics.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Measures of Dispersion CJ 526 Statistical Analysis in Criminal Justice.
Central Tendency & Variability Dec. 7. Central Tendency Summarizing the characteristics of data Provide common reference point for comparing two groups.
Business and Economics 7th Edition
 Deviation is a measure of difference for interval and ratio variables between the observed value and the mean.  The sign of deviation (positive or.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION.
Chapter 3 - Part B Descriptive Statistics: Numerical Methods
1 1 Slide © 2009 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Quantitative Skills: Data Analysis
Overview Summarizing Data – Central Tendency - revisited Summarizing Data – Central Tendency - revisited –Mean, Median, Mode Deviation scores Deviation.
Statistics Chapter 9. Statistics Statistics, the collection, tabulation, analysis, interpretation, and presentation of numerical data, provide a viable.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Objectives Vocabulary
Data Analysis: Part 4 Lesson 7.3 & 7.4. Data Analysis: Part 4 MM2D1. Using sample data, students will make informal inferences about population means.
Education 793 Class Notes Normal Distribution 24 September 2003.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
Measures of Position. ● The standard deviation is a measure of dispersion that uses the same dimensions as the data (remember the empirical rule) ● The.
1 Calculation of unit value indices at Eurostat Training course on Trade Indices Beirut, December 2009 European Commission, DG Eurostat Unit G3 International.
Skewness & Kurtosis: Reference
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 4 Describing Numerical Data.
1 Univariate Descriptive Statistics Heibatollah Baghi, and Mastee Badii George Mason University.
UTOPPS—Fall 2004 Teaching Statistics in Psychology.
Measures of Dispersion How far the data is spread out.
Copyright © 2014 by Nelson Education Limited. 3-1 Chapter 3 Measures of Central Tendency and Dispersion.
Quality control of daily data on example of Central European series of air temperature, relative humidity and precipitation P. Štěpánek (1), P. Zahradníček.
Numeric Summaries and Descriptive Statistics. populations vs. samples we want to describe both samples and populations the latter is a matter of inference…
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 4 – Slide 1 of 23 Chapter 3 Section 4 Measures of Position.
What’s with all those numbers?.  What are Statistics?
Copyright © 2011 Pearson Education, Inc. Describing Numerical Data Chapter 4.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
Lean Six Sigma: Process Improvement Tools and Techniques Donna C. Summers © 2011 Pearson Higher Education, Upper Saddle River, NJ All Rights Reserved.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
Economics 111Lecture 7.2 Quantitative Analysis of Data.
Round Table on Time Series Some Remarks Eurostat.
Data analysis is one of the first steps toward determining whether an observed pattern has validity. Data analysis also helps distinguish among multiple.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Summary of Prev. Lecture
Warm Up What is the mean, median, mode and outlier of the following data: 16, 19, 21, 18, 18, 54, 20, 22, 23, 17 Mean: 22.8 Median: 19.5 Mode: 18 Outlier:
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Univariate Statistics
Reasoning in Psychology Using Statistics
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Descriptive Statistics
Central tendency and spread
Chapter 5: Describing Distributions Numerically
Structural Business Statistics Data validation
Chapter 8 - Estimation.
Measures of Dispersion
Chapter 1 Warm Up .
Statistical reasoning vocabulary review
Data validation handbook
Data processing German foreign trade statistics
Unit 4 Quiz: Review questions
Presentation transcript:

Using cluster analysis for Identifying outliers and possibilities offered when calculating Unit Value Indices OECD NOVEMBER 2011 Evangelos Pongas

1 Objectives of the presentation Present outlier detection methods used by Eurostat unit G5 in the field of international trade of goods detailed statistics (ITGS) Present current investigations in cluster analysis methods and possibilities offered to improve unit value indices

2 Three main outlier detection methods used Outliers at main characteristics of the distribution of detailed data Hidiroglou and Berthelot method K-means clustering

3 Distribution characteristics of monthly detailed data – step 1 For each month and for a period of 12 to 24 months calculate from detailed data: –Mean –Standard deviation –Maximum and Minimum –Skewness and Kurtosis –Count of records Construct 7 seven time series of elements Standardise the time series by deducting average and dividing by standard deviation.

4 Distribution characteristics of monthly detailed data – step 2 Apply classical (mean, standard deviation) and robust (median, quartiles of robust deviation) methods to detect outliers Calculate z-scores = how many times each element of the time series is far in terms of standard deviation from the centre of the distribution (mean). For the N(0,1) distribution, 99.7 of z=scores are less than 3 (or more than -3). Such elements are considered as outlies.

5 Distribution characteristics of monthly detailed data – step 3

6 Distribution characteristics of monthly detailed data – conclusions Fast execution: About 2 hours for all EU Member States Decision support: Publish or not publish Detection of procedural errors: Missing records, generalised errors, empty records

7 Distribution characteristics of monthly detailed data – conclusions Fast execution: About 2 hours for all EU Member States Decision support: Publish or not publish Detection of procedural errors: Missing records, generalised errors, empty records

8 Distribution characteristics of monthly detailed data – conclusions Fast execution: About 2 hours for all EU Member States Decision support: Publish or not publish Detection of procedural errors: Missing records, generalised errors, empty records

9 Distribution characteristics of monthly detailed data – conclusions Fast execution: About 2 hours for all EU Member States Decision support: Publish or not publish Detection of procedural errors: Missing records, generalised errors, empty records

10 Hidiroglou and Berthelot method Selection of data blocks for at least one year monthly data –By product, partner, flow –Eventually by mode of transport Linear transformation of data Application of robust based outlier method based on median and first/third quartiles Weight the importance of the specific data

11 Hidiroglou and Berthelot method: conclusions Univariate method easy to apply Error order according importance Problems when variance Weight the importance of the outlying specific data Often erroneous detection of outliers when variance is high Cannot detect records that violate the correlation structure of the data

12 Detection of outliers with the k-means clustering method: step 1 Selection of data blocks for at least one year monthly data –By product, partner, flow –Eventually by mode of transport Normalization of data Application to raw data and to ratios

13 Detection of outliers with the k-means clustering method: step 2 Application of k-means clustering for 2-5 number of clusters Selection of best number of clusters based on R- square: > 50% and step to higher cluster when more than 10% improvement Detect outlying clusters with small number of data Apply distance function for confirmation of outliers Same approach for inliers. Need to find similar to outliers distance function

14 Detection of outliers with the k-means clustering method: in theory

15 Detection of outliers with the k-means clustering method: in practice (no outliers)

16 Detection of outliers with the k-means clustering method: in practice (with outliers)

17 Other possible uses of k-means clustering method Detection of sub-products for classification and indices purposes Cleaning data for indices purposes –No need to define parameters as in other robust methods –Data grouping according needs –Possibility to define indices at very detailed level Clusters are stable over time (but not geographically)

18 Thank you for your attention!