Dimensionally distributed Pasi Fränti and Sami Sieranoja

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Clustering II.
Data Mining Tools Overview Business Intelligence for Managers.
1. Find the cost of each of the following using the Nearest Neighbor Algorithm. a)Start at Vertex M.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
k-Nearest Neighbors Search in High Dimensions
Aggregating local image descriptors into compact codes
Hierarchical Clustering, DBSCAN The EM Algorithm
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Clustering II.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Advanced GIS Using ESRI ArcGIS 9.3 Arc ToolBox 5 (Spatial Statistics)
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Clustering Methods: Part 2d Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Swap-based.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Estimating Intrinsic Dimension Justin Eberhardt UMD, Mathematics and Statistics Advisor: Dr. Kang James.
Soft Sensor for Faulty Measurements Detection and Reconstruction in Urban Traffic Department of Adaptive systems, Institute of Information Theory and Automation,
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
CS654: Digital Image Analysis Lecture 30: Clustering based Segmentation Slides are adapted from:
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Presented by Ho Wai Shing
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
How to cluster data Algorithm review Extra material for DAA Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.
Genetic Algorithms for clustering problem Pasi Fränti
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Agglomerative clustering (AC)
Clustering (1) Clustering Similarity measure Hierarchical clustering
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Centroid index Cluster level quality measure
Fast nearest neighbor searches in high dimensions Sami Sieranoja
INTRODUCTION TO Machine Learning 3rd Edition
Random Swap algorithm Pasi Fränti
Ch8: Nonparametric Methods
Clustering Uncertain Taxi data
Machine Learning University of Eastern Finland
BIRCH: An Efficient Data Clustering Method for Very Large Databases
A segmentation and tracking algorithm
Outlier Discovery/Anomaly Detection
K Nearest Neighbor Classification
Community Distribution Outliers in Heterogeneous Information Networks
Random Swap algorithm Pasi Fränti
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
K-means properties Pasi Fränti
CSE572, CBS572: Data Mining by H. Liu
Randomized Algorithms
Pasi Fränti and Sami Sieranoja
CSE572: Data Mining by H. Liu
Mean-shift outlier detection
Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Clustering methods: Part 10
Presentation transcript:

Dimensionally distributed Pasi Fränti and Sami Sieranoja density estimation Pasi Fränti and Sami Sieranoja 24.1.2019 P. Fränti and S. Sieranoja, "Dimensionally distributed density estimation", Int. Conf. Artificial Intelligence and Soft Computing (ICAISC), Zakopane, Poland, 343-353, June 2018.

Density in clustering

Density in outlier detection

Definitions

Definition of density Density = mass / volume

Definition of density Density = mass / volume r

Definition of density Density = mass / volume r N

Definition of density Density = mass / volume r N

Two-ways to estimate density Input: Point Output: Density around the point Distance-based Neighbor-based Fix neigborhood (R) Count points (N) Fix number of points (N) Measure size of neigborhood (R)

Two-ways to estimate density Distance-based Neighbor-based 1.9 2 1.1 Input: R-radius Output: Point count Input: k-neighbors Output: Mean distance

Two-ways to estimate density Distance-based Neighbor-based 1.9 3 1.4 1 1.4 2 1.6 0.9 2 2 0.8 1.1 2 1 1.2 1.5 2.0

Summary Distance-based Neighbor-based Measure: N Fixed constant Measure: R

Choice of the parameters Distance-based: R = 10-100% * average distance to data center [2] R = Average pairwise distance of all data points [28] R = 90% * first peak in the pairwise distance histogram [17] R = 0.07 [26] Neighbor-based: k = 10 [18] k = 30 [12] k = 10-100 [27] k = 30-200 [5] k = N [19] k = min{50, N/(2K)} where K is the number of clusters

Bottleneck: finding neighbors O(N2) Distance-based Neighbor-based 3 1 2 4 d(x,y) > R k-nearest

Dimensionally distributed density estimation (DDDE)

Density in categorical data Estimate popularity of individual attributes Cao, Liang, Bai, Expert Systems with Applications, 2009. [Zhang, Farmer, Mandarin] [Malinen, Scientist, Finnish] A B

Sorting in each dimension Sorting by x-values Sorting by y-values Sliding window Sliding window

Independent density estimates x-projection y-projection 1.7 2.0 1.6 1.2 0.3 1.2 0.5 0.6 0.6 0.4 Sliding window 0.5 0.6 0.4 1.8 2.0 0.5 0.7 0.5 1.5 0.9 Sliding window Density value = 0.6 + 0.6 = 1.2

Density estimates DDDE 2-NN

Sliding window technique m— m+ 33 y[i] 73 17 21 26 29 44 47 67 75 77 88 95 15 25 m— m+ 40 y[i] 80 17 21 26 29 44 47 67 75 77 88 95 -26 +47 -67 +88

DDDE algorithm O(DNlogN+DN) O(NlogN) O(N)

Potential false detections

Experiments

Methods compared Clustering algorithms: Density-based sorting + k-means Density peaks [26] Repeated k-means (as reference point) Density estimations: Full search O(N2) Using subsample (s=2%) O(sN2) Using DDDE O(NlogN)

Datasets S1 S2 S3 S4 Unbalance Birch1 Birch2 DIM32 A1 A2 A3

Centroid index (CI) CI = 4 [Fränti, Rezaei, Zhao, Pattern Recognition 2014] CI = 4 empty 15 prototypes (pigeons) 15 real clusters (pigeon holes) empty empty empty

Quality comparison Centroid index

Effect for density peaks Full search DDDE

Speed comparison Seconds

Speed vs. quality

Time profiling

Time profiling

Conclusions Rapid O(DN logN) time algorithm. Remarkable 160:1 speed-up Density estimation no longer bottleneck