Mean-shift outlier detection

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

K-Means Clustering Algorithm Mining Lab
Liang Shan Clustering Techniques and Applications to Image Segmentation.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Random Forest Predrag Radenković 3237/10
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7.3 Estimating a Population mean µ (σ known) Objective Find the confidence.
Mutual Information Mathematical Biology Seminar
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
כמה מהתעשייה? מבנה הקורס השתנה Computer vision.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND Pasi Fränti and.
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
21 June 2009Robust Feature Matching in 2.3μs1 Simon Taylor Edward Rosten Tom Drummond University of Cambridge.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
Clustering Categorical Data
Motivations Paper: Directional Weighted Median Filter Paper: Fast Median Filters Proposed Strategy Simulation Results Conclusion References.
Agglomerative clustering (AC)
Inference: Conclusion with Confidence
Radu Mariescu-Istodor
Data Mining: Basic Cluster Analysis
Lecture 5 of Computer Science II
DATA MINING Spatial Clustering
Summary of “Efficient Deep Learning for Stereo Matching”
Slides by Eamonn Keogh (UC Riverside)
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Random Swap algorithm Pasi Fränti
Statistical Inference
Clustering in Ratemaking: Applications in Territories Clustering
Machine Learning University of Eastern Finland
数据挖掘 Introduction to Data Mining
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Outlier Discovery/Anomaly Detection
RandPing: A Randomized Algorithm for IP Mapping
Detecting Artifacts and Textures in Wavelet Coded Images
Historic Document Image De-Noising using Principal Component Analysis (PCA) and Local Pixel Grouping (LPG) Han-Yang Tang1, Azah Kamilah Muda1, Yun-Huoy.
Random Swap algorithm Pasi Fränti
CSE572, CBS598: Data Mining by H. Liu
Dipdoc Seminar – 15. October 2018
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Data Mining Anomaly/Outlier Detection
Solving TSP in O-Mopsi orienteering game Radu Mariescu-Istodor
DATA MINING Introductory and Advanced Topics Part II - Clustering
K-means properties Pasi Fränti
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Markov Random Fields Presented by: Vladan Radosavljevic.
What Is Good Clustering?
Clustering Wei Wang.
Pasi Fränti and Sami Sieranoja
FLOSCAN: An Artificial Life Based Data Mining Algorithm
CSE572: Data Mining by H. Liu
Automatic dog’s excrement detection system
Dimensionally distributed Pasi Fränti and Sami Sieranoja
Lecture 16. Classification (II): Practical Considerations
Clustering methods: Part 10
Presentation transcript:

Mean-shift outlier detection Jiawei Yang Susanto Rahardja Pasi Fränti 18.11.2018

Clustering

Clustering with noisy data

How to deal with outliers in clustering Approch 1: Cluster Approch 2: Outlier removal Cluster Approch 3: Outlier removal Cluster Mean-shift approch: Mean shift Cluster

Mean-shift

K-nearest neighbors K-neighborhood Point k=4 6 Motivation: Orienteering is difficult to create and maintain. Easier with O-Mopsi. Control points may be lost. Targets might become outdated in O-Mopsi. Not everyone can play the game at 17:00. In O-Mopsi you can play at any time. Disadvantages: Battery, Weather, Point 6

Expected result Mean of neighbors Point 7 Motivation: Orienteering is difficult to create and maintain. Easier with O-Mopsi. Control points may be lost. Targets might become outdated in O-Mopsi. Not everyone can play the game at 17:00. In O-Mopsi you can play at any time. Disadvantages: Battery, Weather, 7

Result with noise point Mean of neighbors Noise point K-neighborhood Motivation: Orienteering is difficult to create and maintain. Easier with O-Mopsi. Control points may be lost. Targets might become outdated in O-Mopsi. Not everyone can play the game at 17:00. In O-Mopsi you can play at any time. Disadvantages: Battery, Weather, 8

Part I: Noise removal Fränti and Yang, "Medoid-shift noise removal to improve clustering", Int. Conf. Artificial Intelligence and Soft Computing (CAISC), June 2018.

Move points to the means of their neighbors Mean-shift process Move points to the means of their neighbors Motivation: Orienteering is difficult to create and maintain. Easier with O-Mopsi. Control points may be lost. Targets might become outdated in O-Mopsi. Not everyone can play the game at 17:00. In O-Mopsi you can play at any time. Disadvantages: Battery, Weather, 10

Mean or Medoid? Mean-shift Medoid-shift

Medoid-shift algorithm Fränti and Yang, "Medoid-shift noise removal to improve clustering", Int. Conf. Artificial Intelligence and Soft Computing (CAISC), June 2018. REPEAT 3 TIMES Calculate kNN(x) Calculate medoid M of the neighbors Replace point x by the medoid M

Iterative processes

Effect on clustering result Noisy original Iteration 1 CI=4 CI=4 Iteration 2 Iteration 3 CI=3 CI=0

Experiments

Datasets P. Fränti and S. Sieranoja, "K-means properties on six clustering benchmark datasets", Applied Intelligence, 2018. k=20 k=35 k=50 N=5000 N=5000 k=15 k=15 N=5000 N=5000 k=15 k=15

Noise types S1 S3 S3 S1 Random noise Random noise Data-dependent noise

Results for noise type 1 KM: K-means with random initialization Centroid Index (CI) CI reduces from 5.1 to 1.4 KM: K-means with random initialization RS: P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 5:13, 1-29, 2018.

Results for noise type 2 KM: K-means with random initialization Centroid Index (CI) CI reduces from 5.0 to 1.1 KM: K-means with random initialization RS: P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 5:13, 1-29, 2018.

Part II: Outlier detection Yang, Rahardja, Fränti, "Mean-shift outlier detection", Int. Conf. Fuzzy Systems and Data Mining (FSDM), November 2018.

Outlier scores 203 12 D=|x-y| k=4 21 Motivation: Orienteering is difficult to create and maintain. Easier with O-Mopsi. Control points may be lost. Targets might become outdated in O-Mopsi. Not everyone can play the game at 17:00. In O-Mopsi you can play at any time. Disadvantages: Battery, Weather, 12 k=4 21

Mean-shift for Outlier Detection Algorithm: MOD Calculate Mean-shift y for every point x Outlier score Di = |yi - xi| SD = Standard deviation of all Di IF Di > SD THEN xi is OUTLIER

Result after three iterations x D=|x-y3| y3 y1 y2 Motivation: Orienteering is difficult to create and maintain. Easier with O-Mopsi. Control points may be lost. Targets might become outdated in O-Mopsi. Not everyone can play the game at 17:00. In O-Mopsi you can play at any time. Disadvantages: Battery, Weather, 23 23

Outlier scores D=|x-y| SD=52 69 203 65 36 13 64 26 44 28 14 40 30 120 187 12 4 46 27 16 58 118 83 95 SD=52 24 24

Distribution of the scores Outliers Normal points

Experimental comparisons

Results for Noise type 1 A priori noise level 7% Automatically estimated (SD)

Results for Noise type 2 A priori noise level 7% Automatically estimated (SD)

Conclusions Why to use: Simple but effective! How it performs: Outperforms existing methods! Usefulness: No threshold parameter needed! 98-99%