Game Trees-Clustering Prof. Sin-Min Lee “ All human beings desire to know” Aristotle, Metaphysics, I.1.

Slides:

Advertisements

Similar presentations

Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering

Cluster Analysis: Basic Concepts and Algorithms

1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Data Mining Cluster Analysis Basics

Hierarchical Clustering, DBSCAN The EM Algorithm

CS690L: Clustering References:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Data Mining Cluster Analysis: Basic Concepts and Algorithms

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Data Mining Techniques: Clustering

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Data Mining Cluster Analysis: Basic Concepts and Algorithms

unsupervised learning - clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms

ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.

What is Cluster Analysis

Cluster Analysis.

Cluster Analysis: Basic Concepts and Algorithms

What is Cluster Analysis?

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

DATA MINING LECTURE 8 Clustering The k-means algorithm

ZHANGXI LIN TEXAS TECH UNIVERSITY Lecture Notes 10 CRM Segmentation - Introduction.

Data Mining Cluster Analysis: Basic Concepts and Algorithms

CLUSTER ANALYSIS.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Data Mining Cluster Analysis: Basic Concepts and Algorithms Adapted from Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar.

CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Data Mining Cluster Analysis: Basic Concepts and Algorithms.

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.

Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)

Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.

Data Mining Cluster Analysis: Basic Concepts and Algorithms.

ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.

Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.

Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.

Data Mining: Basic Cluster Analysis

Clustering CSC 600: Data Mining Class 21.

Data Mining--Clustering

Topic 3: Cluster Analysis

Fuzzy Clustering.

CSCI N317 Computation for Scientific Applications Unit Weka

What Is Good Clustering?

Clustering Wei Wang.

Topic 5: Cluster Analysis

SEEM4630 Tutorial 3 – Clustering.

Hierarchical Clustering

Presentation transcript:

Game Trees-Clustering Prof. Sin-Min Lee “ All human beings desire to know” Aristotle, Metaphysics, I.1.

Decision Tree A decision tree is a predictive model Each interior node corresponds to a variable An arc to a child represents a possible value of that variable A leaf represents the predicted value of target variable given the values of the variables represented by the path from the root.

- Decision tree can be learned by splitting the source set into subsets based on an attribute value test - This process is repeated on each derived subset in a recursive manner - The recursion is completed when splitting is a singular classification which can be applied to each element of the derived subset - It is also for calculating conditional probabilities

Decision tree has three other names 1.Classification tree analysis is used when the predicted outcome is the class to which the data belongs. 2.Regression tree analysis is used when the predicted outcome can be considered a real number 3.CART analysis is to refer to both of the above procedures.

Advantage of Decision Tree simple to understand and interpret require little data preparation able to handle nominal and categorical data. perform well with large data in a short time the explanation for the condition is easily explained by boolean logic.

AprioriTid Algorithm The database is not used at all for counting the support of candidate itemsets after the first pass. 1.The candidate itemsets are generated the same way as in Apriori algorithm. 2.Another set C’ is generated of which each member has the TID of each transaction and the large itemsets present in this transaction. This set is used to count the support of each candidate itemset. The advantage is that the number of entries in C’ may be smaller than the number of transactions in the database, especially in the later passes.

Apriori Algorithm Candidate itemsets are generated using only the large itemsets of the previous pass without considering the transactions in the database. 1.The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1. 2.Each generated itemset, that has a subset which is not large, is deleted. The remaining itemsets are the candidate ones.

Example TIDItems Database ItemsetSupport {1}2 {2}3 {3}3 {5}3 L1L1 ItemsetSupport {1 3}*2 {1 4}1 {3 4}1 {2 3}*2 {2 5}*3 {3 5}*2 {1 2}1 {1 5}1 C2C2 ItemsetSupport {1 3 4}1 {2 3 5}*2 {1 3 5}1 C3C3

Example TIDItems Database ItemsetSupport {1}2 {2}3 {3}3 {5}3 L1L1 ItemsetTID {1 3}100 {1 4}100 {3 4}100 {2 3}200 {2 5}200 {3 5}200 {1 2}300 {1 3}300 {1 5}300 {2 3}300 {2 5}300 {3 5}300 {2 5}400 C2C2 ItemsetTID {1 3 4}100 {2 3 5}200 {1 3 5}300 {2 3 5}300 C3C3

Example TIDItems Database ItemsetSupport {1}2 {2}3 {3}3 {5}3 L1L1 ItemsetSupport {1 2}1 {1 3}*2 {1 5}1 {2 3}*2 {2 5}*3 {3 5}*2 C2C2 ItemsetSupport {2 3 5}*2 C3C3 {1 2 3} {1 3 5} {2 3 5}

Example TIDItems Database ItemsetSupport {1}2 {2}3 {3}3 {5}3 L1L1 ItemsetSupport {1 2}1 {1 3}*2 {1 5}1 {2 3}*2 {2 5}*3 {3 5}*2 C2C2 ItemsetSupport {2 3 5}*2 C3C3 100{1 3} 200{2 3}, {2 5}, {3 5} 300{1 2}, {1 3}, {1 5}, {2 3}, {2 5}, {3 5} 400{2 5} C’ 2 200{2 3 5} 300{2 3 5} C’ 3

No practicable methodology has been demonstrated for reliable prediction of large earthquakes on times scales of decades or less –Some scientists question whether such predictions will be possible even with much improved observations –Pessimism comes from repeated cycles in which public promises that reliable predictions are just around the corner are followed by the equally public failures of specific prediction methodologies. Bad for science!

COMPLEX PLATE BOUNDARY ZONE IN SOUTHEAST ASIA Northward motion of India deforms all of the region Many small plates (microplates) and blocks Molnar & Tapponier, 1977

Mission district — San Francisco Earthquake, 1906 Short-term prediction (forecast)  Frequency and distribution pattern of foreshocks  Deformation of the ground surface: Tilting, elevation changes  Emission of radon gas  Seismic gap along faults  Abnormal animal activities

强烈地震顷刻间将唐山夷为一片平地。图为唐山市区震后废墟

Freeway Damage — 1994 CA Earthquake

Sand Boils after Loma Prieta Earthquake

California Earthquake Probabilities Map

Clustering Group data into clusters –Similar to one another within the same cluster –Dissimilar to the objects in other clusters –Unsupervised learning: no predefined classes Cluster 1 Cluster 2 Outliers

What is Cluster Analysis? Cluster analysis –Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications –to get insight into data –as a preprocessing step

What Is A Good Clustering? High intra-class similarity and low inter- class similarity –Depending on the similarity measure The ability to discover some or all of the hidden patterns

General Applications of Clustering Pattern Recognition Spatial Data Analysis –create thematic maps in GIS by clustering feature spaces –detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW –Document classification –Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

What Is Good Clustering? A good clustering method will produce high quality clusters with –high intra-class similarity –low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

Data Structures in Clustering Data matrix –(two modes) Dissimilarity matrix –(one mode)

Measuring Similarity Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric:d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.

Notion of a Cluster can be Ambiguous How many clusters? Four ClustersTwo Clusters Six Clusters

Hierarchy algorithms Agglomerative: each object is a cluster, merge clusters to form larger ones Divisive: all objects are in a cluster, split it up into smaller clusters

Types of Clusters: Well-Separated Well-Separated Clusters: –A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters

Types of Clusters: Center-Based Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster –The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters

Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) –A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters

Types of Clusters: Density-Based Density-based –A cluster is a dense region of points, which is separated by low- density regions, from other regions of high density. –Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters

Types of Clusters: Conceptual Clusters Shared Property or Conceptual Clusters –Finds clusters that share some common property or represent a particular concept.. 2 Overlapping Circles

Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical ClusteringNon-traditional Dendrogram Traditional Dendrogram

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like diagram that records the sequences of merges or splits

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix

Intermediate Situation After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

After Merging The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

How to Define Inter-Cluster Similarity Similarity? Proximity Matrix p1 p3 p5 p4 p2 p1p2p3p4p MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function

p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MAX MIN

Group Average  l Distance Between Centroids

Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph.

Hierarchical Clustering: MIN Nested ClustersDendrogram

Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the two least similar (most distant) points in the different clusters –Determined by all pairs of points in the two clusters 12345

Hierarchical Clustering: MAX Nested ClustersDendrogram

Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters. Need to use average connectivity for scalability since total proximity favors large clusters 12345

Nested ClustersDendrogram Hierarchical Clustering: Group Average

Hierarchical Clustering: Time and Space requirements O(N 2 ) space since it uses the proximity matrix. –N is the number of points. O(N 3 ) time in many cases –There are N steps and at each step the size, N 2, proximity matrix must be updated and searched –Complexity can be reduced to O(N 2 log(N) ) time for some approaches

Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone No objective function is directly minimized Different schemes have problems with one or more of the following: –Sensitivity to noise and outliers –Difficulty handling different sized clusters and convex shapes –Breaking large clusters