Revision (Part II) Ke Chen

Slides:

Advertisements

Similar presentations

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

Advertisements

K-means Clustering Ke Chen.

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Naïve Bayes Classifier

On Discriminative vs. Generative classifiers: Naïve Bayes

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.

Clustering Unsupervised learning Generating “classes”

Naïve Bayes Classifier Ke Chen Extended by Longin Jan Latecki COMP20411 Machine Learning.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Data mining and machine learning A brief introduction.

Hierarchical Clustering

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.

Data Mining and Text Mining. The Standard Data Mining process.

COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.

COMP24111 Machine Learning K-means Clustering Ke Chen.

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Machine Learning Clustering: K-means Supervised Learning

Naïve Bayes Classifier

Hierarchical Clustering

Ke Chen Reading: [7.3, EA], [9.1, CMB]

Constrained Clustering -Semi Supervised Clustering-

Naïve Bayes Classifier

Machine Learning Lecture 9: Clustering

Data Mining K-means Algorithm

K-means and Hierarchical Clustering

John Nicholas Owen Sarah Smith

Hierarchical and Ensemble Clustering

SEG 4630 E-Commerce Data Mining — Final Review —

Revision (Part II) Ke Chen

Information Organization: Clustering

Ke Chen Reading: [7.3, EA], [9.1, CMB]

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

DATA MINING Introductory and Advanced Topics Part II - Clustering

Naïve Bayes Classifier

Generative Models and Naïve Bayes

Hierarchical and Ensemble Clustering

Dimension reduction : PCA and Clustering

Data Mining – Chapter 4 Cluster Analysis Part 2

INTRODUCTION TO Machine Learning

Clustering Wei Wang.

Nearest Neighbors CSC 576: Data Mining.

Text Categorization Berlin Chen 2003 Reference:

Clustering Techniques

Multivariate Methods Berlin Chen

Junheng, Shengming, Yunsheng 11/09/2018

Multivariate Methods Berlin Chen, 2005 References:

Generative Models and Naïve Bayes

SEEM4630 Tutorial 3 – Clustering.

CSE572: Data Mining by H. Liu

EM Algorithm and its Applications

Presentation transcript:

Revision (Part II) Ke Chen Revision slides are going to summarise all you have learnt from Part II, which should be helpful for you to prepare your exam in January along with those non-assessed exercises also available from the teaching page. COMP24111 Machine Learning

Generative Models and Naïve Bayes Probabilistic Classifiers discriminative vs. generative classifiers Bayesian rule used to convert generative to discriminative, MAP for decision making Naïve Bayesian Assumption Conditionally independent assumption on input attributes Naïve Bayes classification algorithm: discrete vs. continuous features Estimate conditional probabilities for each attribute given a class label and prior probabilities for each class label (training phase) MAP rule for decision making (test phase) Relevant issues Zero conditional probability due to short of training examples Applicability to problems violating the naïve Bayes assumption COMP24111 Machine Learning

Clustering Analysis Basics Clustering Analysis Task discover the “natural” clustering number properly grouping objects into “sensible” clusters Data type and representation Data type: continuous vs. discrete (binary, ranking, …) Data matrix and distance matrix Distance Metric Minkowski distance (Manhattan, Euclidean …) for continuous Cosine measure for nonmetric Distance for binary: contingency table, symmetric vs. asymmetric Major Clustering Approach Partitioning, hierarchical, density-based, spectral, ensemble, … COMP24111 Machine Learning

K-Means Clustering Principle K-means algorithm A typical partitioning clustering approach with an iterative process to minimise the square distance in each cluster K-means algorithm Initialisation: choose K centroids (seed points) Assign each data object to the cluster whose centroid is nearest Re-calculate the mean for each cluster to get a updated centroid Repeat 2) and 3) until no new assignment Application: K-means based image segmentation Relevant issues Efficiency: O(tkn) where t, k <<n Sensitive to initialisation and converge to local optimum Other weakness and limitations COMP24111 Machine Learning

Hierarchical and Ensemble Clustering Hierarchical clustering Principle: partitioning data set sequentially Strategy: divisive (top-down) vs. agglomerative (bottom-up) Cluster Distance Single-link, complete-link and averaging-link Agglomerative algorithm Convert object attributes to distance matrix Repeat until number of cluster is one Merge two closest clusters Update distance matrix with cluster distance Key concepts and technique in heretical clustering Dendrogram tree, life-time of clusters, K life-time Inferring the number of clusters with maximum K life-time Clustering ensemble based on evidence accumulation Multiple k-means clustering with different initialisation, resulting in different partitions; Accumulating the “evidence” from all partitions to form a “collective distance” matrix; Apply agglomerative algorithm to the “collective distances” and decide K using maximum K life-time COMP24111 Machine Learning

Cluster Validation Cluster Validation Evaluate the results of clustering in a quantitative and objective fashion Performance evaluation, clustering comparison, find cluster num. Two different types of cluster validation methods Internal indexes No ground truth available and sometimes named “relative index” Defined based on “common sense” or “a priori knowledge” Variance-based validity indexes Application: finding the “proper” number of clusters, … External indexes Ground truth known or reference given Rand Index Application: performance evaluation of clustering, clustering comparison... Weighted clustering ensemble Key idea: using validity indexes to weighted partitions before evidence accumulation to diminish the effect of trivial partitions COMP24111 Machine Learning

Examination Information Three Sections (total 60 marks) Section A (30 marks) 30 multiple choice questions totally Q1-15 for Part I & Q16-30 for Part II Section B (15 marks) Compulsory questions relevant to Part I Section C (15 marks) Compulsory questions relevant to Part II Length: two hours Calculator (without memory) allowed COMP24111 Machine Learning