Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring Clustering Applications in Outlier Detection for Administrative Data Elizabeth Ayres Sunday, July 29, 2018.

Similar presentations


Presentation on theme: "Exploring Clustering Applications in Outlier Detection for Administrative Data Elizabeth Ayres Sunday, July 29, 2018."— Presentation transcript:

1 Exploring Clustering Applications in Outlier Detection for Administrative Data
Elizabeth Ayres Sunday, July 29, 2018

2 Overview Introduction to the project Overview of the methods
Present preliminary results Summary and future work

3 Introduction Project: Updating the International Merchandise Trade Program Data: Import and export trade transactions Commodity classification (Harmonized Commodity Description and Coding System, or HS), Value, Quantity, Freight, Country of Origin, etc. Task: Identify methods to improve current outlier detection methodology on the Quantity variable for import data

4 Harmonized Commodity Description & Coding System (HS)
HS2: 85 Electrical Machinery and Equipment and Parts Thereof; Sound Recorders and … HS4: Ignition or Starting Equipment; … HS6: distributors and ignition coils of a kind used for spark ignition or compression-ignition internal combustion engines starter motors and dual purpose starter-generators, of a kind used for spark or compression-ignition internal combustion engines generators n.e.c. in heading no. 8511, of a kind used for spark or compression-ignition internal combustion engines

5 Method 1 – Subject Matter Groups with Hidiroglou-Berthelot Method
Group data by: HS10, Business Number, Country of Origin Apply Hidiroglou-Berthelot (HB) method for outlier detection on UV Assumption: an outlier w.r.t. Unit Value (UV) implies an outlier w.r.t. Quantity 𝑈𝑉= 𝑉𝑎𝑙𝑢𝑒 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦

6 Method 2 – Feature Selection and Clustering with HB Method
Subset data by HS6 Run feature selection Group LASSO (Yuan & Lin, 2006), extension of LASSO (Tibshirani, 1996) Calculate proximity matrix (exclude UV) DGower distance measure (based on Gower and Legendre, 1986) Determine clusters based on variables chosen in feature selection Non-Parametric Density Clustering Method Apply HB method to determine outliers in UV within each cluster

7 Method 3 – Feature Selection and Clustering
Subset data + feature selection Calculate proximity matrix (include UV) Apply agglomerative hierarchical clustering Calculate 𝑛𝑐= max 2,𝑐𝑒𝑖𝑙 0.2𝑛 to determine number of clusters (Loureiro et al., 2004) Identify clusters of outliers 1 2 3 𝑛𝑐=3

8 Results – Feature Selection
Data: Ignition or Starting Equipment (HS4=8511) Method 1 Methods 2 and 3 All data HS6=851130 HS6=851140 HS6=851150 HS10 Business Number Country of Origin Customs Office Region Value for Duty Code Entry Type Sales Rate Mode of Transport Region of Export Tariff Code

9 Results – Method 1 vs Method 2
Table 1. Counts of Outliers and Non-Outliers by Method Method 2 Non-Outlier Outlier Total Method 1 19150 450 19600 96 58 154 19246 508 19754 Table 2. Frequency of Outliers Detected Uniquely by Each Method Ranking in the Top 1%, 5%, and 10% Most Influential Observations Total Top 1% (198 obs) Top 5% (988 obs) Top 10% (1976 obs) Method 1 96 100.00% 6 6.25% 9 9.38% 14 14.58% Method 2 450 3.11% 39 8.67% 78 17.33%

10 Results – Method 2 vs Method 3
Table 3. Counts of Outliers and Non-Outliers by Method Method 3 Non-Outlier Outlier Total Method 2 18981 265 19246 379 129 508 19360 394 19754 Table 4. Frequency of Outliers Detected Uniquely by Each Method Ranking in the Top 1%, 5%, and 10% Most Influential Observations Total Top 1% (198 obs) Top 5% (988 obs) Top 10% (1976 obs) Method 2 379 100.00% 14 3.69% 39 10.29% 78 20.58% Method 3 265 12 4.53% 53 20.00% 94 35.47%

11 Summary Requires more investigation into characterising subsets of outliers flagged uniquely by each method Feature selection, compared to subject matter decision, may be more appropriate for our data Group LASSO may not be the optimal choice

12 References Gower, J. C., and Legendre, P. (1986). “Metric and Euclidean Properties of Dissimilarity Coefficients.” Journal of Classification 3:5–48. Hidiroglou, M.A., & Berthelot, J.-M. (1986). Statistical Editing and Imputation for Periodic Business Surveys. Survey Methodology, 12(1), Loureiro, A., Torgo, L., & Soares, C. (2004). Outlier Detection Using Clustering Methods: a data cleaning application. Proceedings of KDNet Symposium on Knowledge-based Systems for the Public Sector. Bonn, Germany. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, 58(1), Retrieved January 26, 2018, from Yuan, M., & Lin, Y. (2006). Model selection and estimates in regression with grouped variables. Journal of the Royal Statistical Society, 68(1), Retrieved January 26, 2018, from

13 THANK YOU! For more information please contact:
#StatCan100


Download ppt "Exploring Clustering Applications in Outlier Detection for Administrative Data Elizabeth Ayres Sunday, July 29, 2018."

Similar presentations


Ads by Google