Christoph F. Eick Questions and Topics Review Dec. 6, 2012 1. Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2 Compute.

Slides:



Advertisements
Similar presentations
Christoph F. Eick Questions and Topics Review Nov. 30, Give an example of a problem that might benefit from feature creation 2.How does DENCLUE.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
HW 4 Answers.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Christoph F. Eick Questions and Topics Review Dec. 1, Give an example of a problem that might benefit from feature creation 2.Compute the Silhouette.
Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Post Silicon Test Optimization Ron Zeira
Clustering II.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
4. Ad-hoc I: Hierarchical clustering
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis (1).
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Modul 8: Sequential Pattern Mining. Terminology  Item  Itemset  Sequence (Customer-sequence)  Subsequence  Support for a sequence  Large/frequent.
Modul 8: Sequential Pattern Mining
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Data Mining Association Rules: Advanced Concepts and Algorithms
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 29 Nov 11, 2005 Nanjing University of Science & Technology.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
Hierarchical Clustering
Hierarchical and Ensemble Clustering
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Revision (Part II) Ke Chen
Hierarchical and Ensemble Clustering
SEEM4630 Tutorial 3 – Clustering.
Unsupervised Learning
Presentation transcript:

Christoph F. Eick Questions and Topics Review Dec. 6, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2 Compute the Silhouette of the following clustering that consists of 2 clusters: {(0,0), (0,1), (2,2)} {(3,2), (3,3)}. Assume Manhattan Distance is used. Silhouette: For an individual point, i –Calculate a = average distance of i to the points in its cluster –Calculate b = min (average distance of i to points in another cluster) –The silhouette coefficient for a point is then given by: s = (b-a)/max(a,b) 3.APRIORI has been generalized for mining sequential patterns. How is the APRIORI property defined and used in the context of sequence mining? 4.Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below (see 2007 Final Exam!): Frequent 3-sequences Candidate Generation Candidates that survived pruning

Christoph F. Eick Answers 1. a.AGNES creates set of clustering/a dendrogram; K-Means creates a single clustering b.K-means forms cluster by using an iteration procedure which minimizes an objective functions, AGNES forms the dendrogram by merging the closest 2 clusters until a single cluster is obtained c.… 3. If a k-sequence s is frequent than all its (k-1) subsequences are frequent How used by the algorithm? a.Frequent k-sequences are computed by combining frequent (k-1)-sequences b.For subset pruning: if a k-sequence s has a (k-1)-subsequence which not frequent, then s is not frequent.

Christoph F. Eick Questions and Topics Review Dec. 6, The Top 10 Data Mining Algorithms article says about k-means “The greedy-descent nature of k- means on a non-convex cost also implies that the convergence is only to a local optimum, and indeed the algorithm is typically quite sensitive to the initial centroid locations…The local minima problem can be countered to some extent by running the algorithm multiple times with different initial centroids.” Explain why the suggestion in boldface is a potential solution to the local maximum problem. Propose a modification of the k-means algorithm that uses the suggestion! 6. What is the role of slack variables in the Linear/SVM/Non-separable approach (textbook pages )—what do they measure? What properties of hyperplanes are maximized by the objective function f(w) (on page 268) in the approach? 7. Give the equation system that PAGERANK would use for the webpage structure given on the below. give a sketch of an approach that determines the page rank of the 4 pages from the equation system! 8. What is a data warehouse; how is it different from a traditional database? 9. Example Essay-style Question: Assume you own an online book store which sells books over the internet. How can your business benefit from data mining? Limit your answer to 7-10 sentences!

Christoph F. Eick Sample Network Structure N P1P2 P3 P4 PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) PR(P4)= (1-d) + d*PR(P3)/3

Christoph F. Eick Questions and Topics Review Dec. 1, Give an example of a problem that might benefit from feature creation 2.Compute the Silhouette of the following clustering that consists of 2 clusters: {(0,0), (0,1), (2,2)} {(3,2), (3,3)}. Silhouette: For an individual point, i –Calculate a = average distance of i to the points in its cluster –Calculate b = min (average distance of i to points in another cluster) –The silhouette coefficient for a point is then given by: s = (b-a)/max(a,b) 3.APRIORI has been generalized for mining sequential patterns. How is the APRIORI property defined and used in the context of sequence mining? Property: see text book [2] Use: Combine sequences that a frequent and which agree in all elements except the first element of the first sequence, and the last element of the second sequence. Prune sequences if not all subsequences that can be obtained by removing a single element are frequent. [3] 3.Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below: Frequent 3-sequences Candidate Generation Candidates that survived pruning

Christoph F. Eick Questions and Topics Review Dec. 1, Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below: Frequent 3-sequences Candidate Generation Candidates that survived pruning 3) Association Rule and Sequence Mining [15] a) Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below: Candidates that survived pruning: Candidate Generation:  survived  pruned, (1 3) (4) is infrequent  pruned (1) (4 5) is infrequent  pruned, (1 2) (4) is infrequent  pruned, (2) (4 5) is infrequent What if the ans are correct, but this part of description isn’t giving?? Do I need to take any points off?? Give an extra point if explanation is correct and present; otherwise subtract a point; more than 2 errors: 2 points or less! Frequent 3-sequences Candidate Generation Candidates that survived pruning What candidate 4-sequences are generated from this 3-sequence set? Which of the generated 4-sequences survive the pruning step? Use format of Figure 7.6 in the textbook on page 435 to describe your answer! [7]

Christoph F. Eick 5. The Top 10 Data Mining Algorithms article says about k-means “The greedy-descent nature of k- means on a non-convex cost also implies that the convergence is only to a local optimum, and indeed the algorithm is typically quite sensitive to the initial centroid locations…The local minima problem can be countered to some extent by running the algorithm multiple times with different initial centroids.” Explain why the suggestion in boldface is a potential solution to the local maximum problem. Propose a modification of the k-means algorithm that uses the suggestion! Using k-means with different seeds will find different local maxima of K-mean’s objective function; therefore, running k-means with different initial seeds that are in proximity of different local maxima will produce alternative results.[2] Run k-means with different seeds multiple times (e.g. 20 times), then compute the SSE of each clustering, return the clustering with the lowest SSE value as the result. [3]