Chapter 11 Automatic Cluster Detection. 2 Data Mining Techniques So Far… Chapter 5 – Statistics Chapter 6 – Decision Trees Chapter 7 – Neural Networks.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

What is Statistics? Chapter One GOALS ONE
0 - 0.
Nearest Neighbour Analysis
1 Week 1 Review of basic concepts in statistics handout available at Trevor Thompson.
The Poisson distribution
MEAN AND VARIANCE OF A DISTRIBUTION
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
Copyright Jiawei Han, modified by Charles Ling for CS411a
1. 2 GUIDELINES 1. Identify the variable(s) of interest (the focus) and the population of the study. 2. Develop a detailed plan for collecting data. If.
Section 1-2 Data Classification
Year 10 Pathway C Mr. D. Patterson 1. Describe the concepts of distance, displacement, speed and velocity Use the formula v=s/t to calculate the unknown.
Chapter 7 Classification and Regression Trees
Test B, 100 Subtraction Facts
Chapter 11: The t Test for Two Related Samples
Experimental Design and Analysis of Variance
Multiple Regression and Model Building
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Clustering Basic Concepts and Algorithms
Decision Tree Algorithm
Chapter 2: Pattern Recognition
Data Mining.
Classical Techniques: Statistics, Neighborhoods, and Clustering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Database Processing for Business Intelligence Systems
 The Law of Large Numbers – Read the preface to Chapter 7 on page 388 and be prepared to summarize the Law of Large Numbers.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Enterprise systems infrastructure and architecture DT211 4
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences KDD Group Presentation #7 Fall’01 Friday, November 9, 2001 Cecil P. Schmidt Department.
Chapter 13 Genetic Algorithms. 2 Data Mining Techniques So Far… Chapter 5 – Statistics Chapter 6 – Decision Trees Chapter 7 – Neural Networks Chapter.
DATA MINING CLUSTERING K-Means.
Smith/Davis (c) 2005 Prentice Hall Chapter Four Basic Statistical Concepts, Frequency Tables, Graphs, Frequency Distributions, and Measures of Central.
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
Business Intelligence Systems Appendix J DAVID M. KROENKE and DAVID J. AUER DATABASE CONCEPTS, 6 th Edition.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 4 Section 2 – Slide 1 of 20 Chapter 4 Section 2 Least-Squares Regression.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Automatic Cluster Detection Automatic Cluster Detection is useful to find “better behaved” clusters of data within a larger dataset; seeing the forest.
Chapter 1: Section 2-4 Variables and types of Data.
Knowledge Discovery and Data Mining 19 th Meeting Course Name: Business Intelligence Year: 2009.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Statistics Vocabulary. 1. STATISTICS Definition The study of collecting, organizing, and interpreting data Example Statistics are used to determine car.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Data Mining 101 with Scikit-Learn
Data Mining Techniques So Far…
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining – Chapter 4 Cluster Analysis Part 2
Topic 5: Cluster Analysis
Clustering The process of grouping samples so that the samples are similar within each group.
CSE572: Data Mining by H. Liu
Unsupervised Learning
Presentation transcript:

Chapter 11 Automatic Cluster Detection

2 Data Mining Techniques So Far… Chapter 5 – Statistics Chapter 6 – Decision Trees Chapter 7 – Neural Networks Chapter 8 – Nearest Neighbor Approaches: Memory- Based Reasoning and Collaborative Filtering Chapter 9 – Market Basket Analysis & Association Rules Chapter 10 – Link Analysis

3 Automatic Cluster Detection DM techniques used to find patterns in data –Not always easy to identify No observable pattern Too many patterns Decomposition (break down into smaller pieces) [example: Olympics] Automatic Cluster Detection is useful to find “better behaved” clusters of data within a larger dataset; seeing the forest without getting lost in the trees

4 Automatic Cluster Detection K-Means clustering algorithm – similar to nearest neighbor techniques (memory-based-reasoning and collaborative filtering) – depends on a geometric interpretation of the data Other automatic cluster detection (ACD) algorithms include: –Gaussian mixture models –Agglomerative clustering –Divisive clustering –Self-organizing maps (SOM) – Ch. 7 – Neural Nets ACD is a tool used primarily for undirected data mining –No preclassified training data set –No distinction between independent and dependent variables When used for directed data mining –Marketing clusters referred to as “segments” –Customer segmentation is a popular application of clustering ACD rarely used in isolation – other methods follow up

5 Clustering Examples “Star Power” ~ 1910 Hertzsprung-Russell Group of Teens 1990’s US Army – women’s uniforms: 100 measurements for each of 3,000 women Using K-means algorithm reduced to a handful

6 K-means Clustering “K” – circa 1967 – this algorithm looks for a fixed number of clusters which are defined in terms of proximity of data points to each other How K-means works (see next slide figures): –Algorithm selects K (3 in figure 11.3) data points randomly –Assigns each of the remaining data points to one of K clusters (via perpendicular bisector) –Calculate the centroids of each cluster (uses averages in each cluster to do this)

7 K-means Clustering

8 Resulting clusters describe underlying structure in the data, however, there is no one right description of that structure (Ex: Figure 11.6 – playing cards K=2, K=4)

9 K-means Clustering Demo Clustering demo: –

10 Similarity & Difference Automatic Cluster Detection is quite simple for a software program to accomplish – data points, clusters mapped in space However, business data points are not about points in space but about purchases, phone calls, airplane trips, car registrations, etc. which have no obvious connection to the dots in a cluster diagram

11 Similarity & Difference Clustering business data requires some notion of natural association – records (data) in a given cluster are more similar to each other than to those in another cluster For DM software, this concept of association must be translated into some sort of numeric measure of the degree of similarity Most common translation is to translate data values (eg., gender, age, product, etc.) into numeric values so can be treated as points in space If two points are close in geometric sense then they represent similar data in the database

12 Similarity & Difference Business variable (fields) types: –Categorical (eg., mint, cherry, chocolate) –Ranks (eg., freshman, soph, etc. or valedictorian, salutatorian) –Intervals (eg., 56 degrees, 72 degrees, etc) –True measures – interval variables that measure from a meaningful zero point Fahrenheit, Celsius not good examples Age, weight, height, length, tenure are good Geometric standpoint the above variable types go from least effective to most effective (top to bottom) Finally, there are dozens/hundreds of published techniques for measuring the similarity of two data records

13 Other Approaches to Cluster Detection Gaussian Mixture Models Agglomerative Clustering Divisive Clustering Self-Organizing Maps (SOM) [Chapter 7]

14 Evaluating Clusters What does it mean to say that a cluster is “good”? –Clusters should have members that have a high degree of similarity –Standard way to measure within-cluster similarity is variance* – clusters with lowest variance is considered best –Cluster size is also important so alternate approach is to use average variance** * The sum of the squared differences of each element from the mean ** The total variance divided by the size of the cluster

15 Evaluating Clusters Finally, if detection identifies good clusters along with weak ones it could be useful to set the good ones aside (for further study) and run the analysis again to see if improved clusters are revealed from only the weaker ones

16 Case Study: Clustering Towns Review using book, pp “Best” based on delivery penetration “2 nd Best” based on delivery penetration Cluster 2 Cluster 1B Cluster 1AB

17 End of Chapter 11