Data Clustering 1 – An introduction

Slides:



Advertisements
Similar presentations
PARTITIONAL CLUSTERING
Advertisements

Introduction to Bioinformatics
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
CLUSTERING (Segmentation)
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Data Mining – Intro.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Data Mining: A Closer Look
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Chapter 1 Introduction to Data Mining
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Data Mining and Text Mining. The Standard Data Mining process.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining Concept Submitted TO: Mrs. MONIKA SUBMITTED BY: SHALU 4717.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Unsupervised Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
Data Mining ICCM
What Is Cluster Analysis?
Machine Learning Clustering: K-means Supervised Learning
DATA MINING © Prentice Hall.
Introduction C.Eng 714 Spring 2010.
Adrian Tuhtan CS157A Section1
Topic 3: Cluster Analysis
Sangeeta Devadiga CS 157B, Spring 2007
Data Warehousing and Data Mining
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Topic 5: Cluster Analysis
Unsupervised Learning
Presentation transcript:

Data Clustering 1 – An introduction Slide 1

The Data Explosion “If you feel like you are drowning in information, it’s because you are.” Advance of IT and the Internet Massive increase in ability to: Record: Electronic records and forms Store: Data Warehouses (as we have seen) Analyse: Data Mining and Visualisation (more later) Risk of Information Overload Data Clustering – An Introduction Slide 2

The Aims of Data Mining Classification Association Detection Categorising Risk-Return of Stocks Association Identify products that tend to sell together Detection Identify profiles of customers Prediction Forecasting Market Performance Data Clustering – An Introduction Slide 3

Database Technology Timeline Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s: Data mining and data warehousing, multimedia databases, and Web databases Data Clustering – An Introduction Slide 4

From Data to Knowledge Common to break down the process of learning from data into the following: Data, Information and Knowledge Data Clustering – An Introduction Slide 5

From Data to Knowledge Data: Raw numbers Information: Data with context or meaning Knowledge: Data Structures / Patterns (Knowledge must be useful) Data Clustering – An Introduction Slide 6

Data Mining / Intelligent Data Analysis “Data mining is applying Machine Learning techniques to historical data to improve future decisions” Tom Mitchell 1997 Data Clustering – An Introduction Slide 7

Knowledge Discovery Knowledge Discovery in Databases (KDD) The Process (from Advances in KDD and Data mining): Data Knowledge Patterns Target Data Pre-processed Data Transformed Data Data Clustering – An Introduction Slide 8

Data Mining - Tools Typical tools Statistical Analysis Summarisation Outlier Detection Correlation Regression Clustering Association Rules Time Series Models Decision Trees (classification) Data Clustering – An Introduction Slide 9

Data Mining - Applications Some successful examples of its use: Pharmaceutical companies – Drug Discovery Credit card companies – Fraud Detection Transportation companies - Routing Large consumer package goods companies (to improve the sales process to retailers) Hospital Organisation – Decision Analysis Data Clustering – An Introduction Slide 10

Examples of Data Mining Tools We will now look at some core techniques commonly used for analysing and mining business warehouses Correlation Visualisation Clustering Regression Data Clustering – An Introduction Slide 11

Clustering An example in biology… plants animals clustering is a basic learning algorithm we do clustering everywhere of our lives an example in biology introduce the 3 concepts in pattern recognition – patterns, features, classes. Things that are brown and run away Things that are green and don’t run away Data Clustering – An Introduction Slide 12

Clustering An example in biology… Kingdom Phylum Class Order Family Genus Species Hierarchical clustering (more later) clustering is a basic learning algorithm we do clustering everywhere of our lives an example in biology introduce the 3 concepts in pattern recognition – patterns, features, classes. Data Clustering – An Introduction Slide 13

Clustering The process Extract features (colour, movement, sensory organs etc): more later Cluster into categories Consolidation clustering is a basic learning algorithm we do clustering everywhere of our lives an example in biology introduce the 3 concepts in pattern recognition – patterns, features, classes. Data Clustering – An Introduction Slide 14

Clustering Clustering: to partition a data set into subsets (clusters), so that the data in each subset share some common trait - often similarity or proximity for some defined distance measure. The process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Unsupervised: No need for the ‘teacher’ signals, i.e. the desired output. x2 Cluster 1 clustering: extend the concept of living beings to a general one key issues of clustering analysis Cluster 2 x1

Supervised and Unsupervised Learning Unsupervised learning: learning without the desired output (‘teacher’ signals). Supervised learning: learning with the desired output. Clustering is one of the widely-used unsupervised learning methods. Other unsupervised learning: Dimensionality reduction (factor analysis, principal component analysis, independent component analysis …) Time serious modelling Source separation Supervised learning: Classification Regression briefly introduce the concepts of supervised and unsupervised learning misunderstanding: clustering = unsupervised learning Data Clustering – An Introduction Slide 16

Patterns, Clusters and Features (1) Patterns: physical objects Clusters: categories of objects Features: attributes of objects animals plants before introducing how to perform clustering, the basic concepts patterns clusters features Colour: brown, green, …

Patterns, Clusters and Features (2) Features’ space Creating vehicles’ clusters 3500 3000 Lorries 2500 cluster 2000 Sports cars Weight [kg] 1500 Medium market cars Another example for vehicle clustering individual cars – patterns(objects) features – weight/speed clusters – lorries/sport cars/cars Feature-1 values 1000 500 100 150 200 250 300 Top speed [ml/h] Feature-2 values

Social networks Marketing Terror networks Allocation of resources in a company / university Data Clustering – An Introduction

Gene networks Understanding gene interactions Identifying important genes linked to disease Data Clustering – An Introduction

How to do clustering? What we know: patterns represented by their feature vectors, e.g. General case: is in the d -dimensional domain of the feature vectors x2 Cluster 1 Cluster 2 what we know patterns feature vectors examples like animals and cars 2. what we need to find out the number of clusters the clusters, in a form easy for computing What we need to find out: the clusters x1

Pattern Similarity A key concept in clustering: similarity. Clusters are formed by similar patterns. In computer science, we need to define some metric to measure similarity. One of the commonly adopted similarity metrics is distance. A general definition of distance (between pattern A and B): b=2: Euclidean distance b=1: Manhattan distance a key concept of clustering, and many other pattern recognition techniques, is similarity. distance similarity is inversely proportional to the distance – this sometimes presents problems. The shorter the distance, the more similar the two patterns.

Pattern Similarity & Distance Metrics Many methods are designed to work on Distance Metrics, e.g. K-Means They assume that the Triangle Inequality holds: “the sum of the lengths of any two sides must be greater than the length of the remaining side” Data Clustering – An Introduction

Pattern Similarity & Distance Metrics Euclidean Correlation Minkowski Manhattan Mahalanobis Relationship Metrics How Long is a Piece of String? Often Application Dependant Data Clustering – An Introduction

K-Means Clustering 25

Algorithm 1: K-Means Clustering Place K points into the feature space. These points represent initial cluster centroids. Assign each pattern to the closest cluster centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the assignments do not change. this description is rather generic. many issues are unspecified. initialisation assignment updating For example, if using (1) Euclidean distance, (2) average of patterns, what the algorithm becomes? Data Clustering – An Introduction Slide 26

K-Means Clustering Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html Data Clustering – An Introduction Slide 27

Discussions (1) 1. How to determine k, the number of clusters? Data Clustering – An Introduction Slide 28

Discussions (2) 2. Any alternative ways of choosing the initial cluster centroids? Data Clustering – An Introduction Slide 29

Discussions (3) 3. Does the algorithm converge to the same results with different selections of initial cluster centroids? If not, what should we do in practice? Data Clustering – An Introduction Slide 30

Reading Chapter 9, Section 9.3: David Hand “Principles of Data Mining”, MIT Press Chapter 8: Pang-Ning Tan “Introduction to Data Mining” Anil Jain: “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters Data Clustering – An Introduction Slide 31

Lab In the lab: Examine a piece of JAVA code for K-Means clustering Explore the use of K-Means on some Toy datasets Visualise the clusterings using an EXCEL macro Data Clustering – An Introduction Slide 32