כריית מידע -- Clustering

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Chapter 4: Unsupervised Learning Dr. Mehmet S. Aktaş Acknowledgement: Thanks to Dr. Bing Liu for teaching materials.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Data Mining K-means Algorithm
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
CS 685: Special Topics in Data Mining Jinze Liu
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
CSE572, CBS598: Data Mining by H. Liu
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Statistical Methods
CSE572, CBS572: Data Mining by H. Liu
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Clustering Techniques
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

כריית מידע -- Clustering ד"ר אבי רוזנפלד

הרעיון הכללי: דברים דומים הם דומים איך נאסוף דברים דומים Regression, Classification (Supervised), k-nn Clustering (Unsupervised) k-meand Partitioning Algorithms (k-mean), Hierarchical Algorithms שאלות פתוחות: איך להגדיר "קירבה" מרחק Euclidean מרחק Manhattan (Judea Pearl) הרבה אופציות אחריות

איך לסווג את סימן השאלה?

K-Nearest Neighbor בודקים את הסיווג בזמן אמת model free צריכים לקבוע את מספר השכנים בדרך כלל יש שקלול לפי המרחק מהנקודה גם CBR או Case Based Reasoning דומה בסיווג הולכים לפי הרוב (או איזשהו משקל לפי הקרבה) ברגרסיה הערך יהיה לפי הרוב (או איזשהו משקל לפי הקרבה)

1-Nearest Neighbor

3-Nearest Neighbor

k NEAREST NEIGHBOR k = 1: k = 3: k = 7: Choosing the value of k: ? k = 1: Belongs to square class k = 3: Belongs to triangle class k = 7: Belongs to square class Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Choose an odd value for k, to eliminate ties 8 7 ICDM: Top Ten Data Mining Algorithms k nearest neighbor classification December 2006

Remarks +Highly effective inductive inference method for noisy training data and complex target functions +Target function for a whole space may be described as a combination of less complex local approximations +Learning is very simple - Classification is time consuming

האלגוריתם הבסיסי ל : Clustering K-MEAN מתוך אוכלוסיית המדגם שנבחרה (להלן הנקודות), בחרK נקודות אקראיות. נקודות אלו הם המרכזים ההתחלתיים של האשכולות(Seeds) קבע את המרחק האוקלידי של כל הנקודות מהמרכזים שנבחרו כל נקודה משויכת למרכז הקרוב אליה ביותר. בצורה זו קיבלנו K אשכולות זרים זה לזה. בכל אשכול: קבע נקודות מרכז חדשה על ידי חישוב הממוצע של כל הנקודות באשכול אם נקודת המרכז שווה לנקודה הקודמת התהליך הסתיים , אחרת חזור ל 3

דוגמא עם 6 נקודות Instance X Y 1 1.0 1.5 2 4.5 3 2.0 4 3.5 5 3.0 2.5 6 5.0 6.0

דוגמא עם 6 נקודות

איטרציה 1 המרחק מ C1 המרחק מ C2 0.00 1.00 3.00 3.16 2.24 2.00 1.41 איטרציה 1 באופן אקראי נבחרו הנקודות 1,3 להלן C1,C2 למרכז C1 נבחרות נקודות 1,2. למרכז C2 נבחרו הנקודות 3,4,5,6 נוסחת המרחק: ² ( Distance= √(x1-x2)² + ( y1-y2 המרחק מ C1 המרחק מ C2 0.00 1.00 3.00 3.16 2.24 2.00 1.41 6.02 5.41

בחירת מרכזים חדשים ל C1 ל C2 X=(1.0+1.0)/2=1.0 Y=(1.5+4.5)/2=3.0

איטרציה 2 נקודות המרכז החדשות: C1(1.0, 3.0) C2(3.0, 3.375) 1.5 2.74 2.29 1.8 2.125 1.12 1.01 2.06 0.875 5.00 3.30

התוצאה הסופית

בעיות עם k-means על המשתמש להגדיר מראש K מניח שניתן לחשב את הממוצע מאוד רגיש ל outliers Outliers הם נקודות הרחוקות מהאחרים יכול להיות סתם טעות... CS583, Bing Liu, UIC

דוגמא של OUTLIER CS583, Bing Liu, UIC

מרחק Euclidean Euclidean distance: Properties of a metric d(i,j): d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j)

Hierarchical Clustering Produce a nested sequence of clusters, a tree, also called Dendrogram. CS583, Bing Liu, UIC

Types of hierarchical clustering Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and merges the most similar (or nearest) pair of clusters stops when all the data points are merged into a single cluster (i.e., the root cluster). Divisive (top down) clustering: It starts with all data points in one cluster, the root. Splits the root into a set of child clusters. Each child cluster is recursively divided further stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point CS583, Bing Liu, UIC

Agglomerative clustering It is more popular then divisive methods. At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least distance. Go on merging Eventually all nodes belong to one cluster CS583, Bing Liu, UIC

Agglomerative clustering algorithm CS583, Bing Liu, UIC

An example: working of the algorithm CS583, Bing Liu, UIC

Measuring the distance of two clusters A few ways to measure distances of two clusters. Results in different variations of the algorithm. Single link Complete link Average link Centroids … CS583, Bing Liu, UIC

Single link method The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster. It can find arbitrarily shaped clusters, but It may cause the undesirable “chain effect” by noisy points Two natural clusters are split into two CS583, Bing Liu, UIC

Complete link method The distance between two clusters is the distance of two furthest data points in the two clusters. It is sensitive to outliers because they are far away CS583, Bing Liu, UIC

EM Algorithm Expectation step: assign points to clusters Initialize K cluster centers Iterate between two steps Expectation step: assign points to clusters Maximation step: estimate model parameters