Christoph F. Eick Questions and Topics Review Dec. 1, 2011 1.Give an example of a problem that might benefit from feature creation 2.Compute the Silhouette.

Slides:



Advertisements
Similar presentations
Christoph F. Eick Questions and Topics Review Nov. 30, Give an example of a problem that might benefit from feature creation 2.How does DENCLUE.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
HW 4 Answers.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Clustering V. Outline Validating clustering results Randomization tests.
Chapter 10 Algorithmic Thinking. Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Learning Objectives List the five essential.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Generalized Sequential Pattern (GSP) Step 1: – Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat.
Post Silicon Test Optimization Ron Zeira
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis (1).
Cluster Validation.
Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Association Rule Mining on Multi-Media Data Auto Annotation on Images Bhavika Patel Hau San Si Tou Juveria Kanodia Muhammad Ahmad.
Chapter 3 Descriptive Measures
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Tools for Privacy Preserving Distributed Data Mining
Lecture 20: Cluster Validation
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Modul 8: Sequential Pattern Mining. Terminology  Item  Itemset  Sequence (Customer-sequence)  Subsequence  Support for a sequence  Large/frequent.
Modul 8: Sequential Pattern Mining
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Christoph F. Eick Questions and Topics Review Dec. 6, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2 Compute.
Data Mining Association Rules: Advanced Concepts and Algorithms
Pseudocode Algorithms Using Sequence, Selection, and Repetition Simple Program Design Third Edition A Step-by-Step Approach 6.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 29 Nov 11, 2005 Nanjing University of Science & Technology.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
S EQUENTIAL P ATTERNS & THE GSP A LGORITHM BY : J OE C ASABONA.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Automatic Cluster Detection Automatic Cluster Detection is useful to find “better behaved” clusters of data within a larger dataset; seeing the forest.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
A SSOCIATION R ULES & THE A PRIORI A LGORITHM BY : J OE C ASABONA.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Data Mining and Text Mining. The Standard Data Mining process.
Text Structure For Students. Overview What is text structure? What are the common text structures? How does text structure help readers understand nonfiction?
Why isn’t (1, 1) a solution of the system of linear inequalities y > -x and y > x +1?
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
CSE 4705 Artificial Intelligence
SEG 4630 E-Commerce Data Mining — Final Review —
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Association Rule Mining
Algorithm Discovery and Design
Hierarchical and Ensemble Clustering
Simple Kmeans Examples
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Christoph F. Eick Questions and Topics Review Dec. 1, Give an example of a problem that might benefit from feature creation 2.Compute the Silhouette of the following clustering that consists of 2 clusters: {(0,0), (0,1), (2,2)} {(3,2), (3,3)}. Assume Manhattan Distance is used. Silhouette: For an individual point, i –Calculate a = average distance of i to the points in its cluster –Calculate b = min (average distance of i to points in another cluster) –The silhouette coefficient for a point is then given by: s = (b-a)/max(a,b) 3.APRIORI has been generalized for mining sequential patterns. How is the APRIORI property defined and used in the context of sequence mining? 4.Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below (see 2007 Final Exam!): Frequent 3-sequences Candidate Generation Candidates that survived pruning

Christoph F. Eick Questions and Topics Review Dec. 1, Give an example of a problem that might benefit from feature creation 2.Compute the Silhouette of the following clustering that consists of 2 clusters: {(0,0), (0,1), (2,2)} {(3,2), (3,3)}. Silhouette: For an individual point, i –Calculate a = average distance of i to the points in its cluster –Calculate b = min (average distance of i to points in another cluster) –The silhouette coefficient for a point is then given by: s = (b-a)/max(a,b) 3.APRIORI has been generalized for mining sequential patterns. How is the APRIORI property defined and used in the context of sequence mining? Property: see text book [2] Use: Combine sequences that a frequent and which agree in all elements except the first element of the first sequence, and the last element of the second sequence. Prune sequences if not all subsequences that can be obtained by removing a single element are frequent. [3] 3.Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below: Frequent 3-sequences Candidate Generation Candidates that survived pruning

Christoph F. Eick Questions and Topics Review Dec. 1, Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below: Frequent 3-sequences Candidate Generation Candidates that survived pruning 3) Association Rule and Sequence Mining [15] a) Assume the Apriori-style sequence mining algorithm described at pages is used and the algorithm generated 3-sequences listed below: Candidates that survived pruning: Candidate Generation:  survived  pruned, (1 3) (4) is infrequent  pruned (1) (4 5) is infrequent  pruned, (1 2) (4) is infrequent  pruned, (2) (4 5) is infrequent What if the ans are correct, but this part of description isn’t giving?? Do I need to take any points off?? Give an extra point if explanation is correct and present; otherwise subtract a point; more than 2 errors: 2 points or less! Frequent 3-sequences Candidate Generation Candidates that survived pruning What candidate 4-sequences are generated from this 3-sequence set? Which of the generated 4-sequences survive the pruning step? Use format of Figure 7.6 in the textbook on page 435 to describe your answer! [7]

Christoph F. Eick Questions and Topics Review Dec. 1, The Top 10 Data Mining Algorithms article says about k-means “The greedy-descent nature of k- means on a non-convex cost also implies that the convergence is only to a local optimum, and indeed the algorithm is typically quite sensitive to the initial centroid locations…The local minima problem can be countered to some extent by running the algorithm multiple times with different initial centroids.” Explain why the suggestion in boldface is a potential solution to the local maximum problem. Propose a modification of the k-means algorithm that uses the suggestion!

Christoph F. Eick 5. The Top 10 Data Mining Algorithms article says about k-means “The greedy-descent nature of k- means on a non-convex cost also implies that the convergence is only to a local optimum, and indeed the algorithm is typically quite sensitive to the initial centroid locations…The local minima problem can be countered to some extent by running the algorithm multiple times with different initial centroids.” Explain why the suggestion in boldface is a potential solution to the local maximum problem. Propose a modification of the k-means algorithm that uses the suggestion! Using k-means with different seeds will find different local maxima of K-mean’s objective function; therefore, running k-means with different initial seeds that are in proximity of different local maxima will produce alternative results.[2] Run k-means with different seeds multiple times (e.g. 20 times), then compute the SSE of each clustering, return the clustering with the lowest SSE value as the result. [3]