CPH 636 - Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted.

Slides:

Advertisements

Similar presentations

Critical Reading Strategies: Overview of Research Process

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Hierarchical Clustering, DBSCAN The EM Algorithm

Albert Gatt Corpora and Statistical Methods Lecture 13.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Unsupervised learning

Chapter 4: Linear Models for Classification

Unsupervised learning: Clustering Ata Kaban The University of Birmingham

How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Visual Recognition Tutorial

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.

COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 5-1 Chapter 5 Some Important Discrete Probability Distributions Basic Business Statistics.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.

Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

Data Mining and Text Mining. The Standard Data Mining process.

STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

CPH Dr. Charnigo Chap. 9 Notes To begin with, have a look at Figure 9.5 on page 315. One can get an intuitive feel for how a tree works by examining.

CPH Dr. Charnigo Chap. 4 Notes The authors now turn to linear classification methods, by which they mean that decision boundaries are linear. Figure.

Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

Unsupervised Learning

Big data classification using neural network

Unsupervised Learning

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering

PREDICT 422: Practical Machine Learning

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Statistical Data Analysis - Lecture /04/03

Investment Analysis and Portfolio management

LECTURE 11: Advanced Discriminant Analysis

Machine Learning Lecture 9: Clustering

Data Mining K-means Algorithm

The Elements of Statistical Learning

Machine Learning Basics

K-means and Hierarchical Clustering

John Nicholas Owen Sarah Smith

Revision (Part II) Ke Chen

Gerald Dyer, Jr., MPH October 20, 2016

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Revision (Part II) Ke Chen

Multivariate Statistics

Integration of sensory modalities

CSCI N317 Computation for Scientific Applications Unit Weka

Cluster Analysis.

Text Categorization Berlin Chen 2003 Reference:

Mathematical Foundations of BME Reza Shadmehr

Machine Learning and Data Mining Clustering

Unsupervised Learning

Unsupervised Learning

Presentation transcript:

CPH Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted but is less common.) We observe training data (x 1,y 1 ), (x 2,y 2 ), …, (x n,y n ) and develop a model for predicting Y from X. This model is called a “learner” and presents an “answer” of ŷ j for j = 1 to n.

CPH Dr. Charnigo Chap. 14 Notes The “supervisor” then “grades” the answer, for example by returning the residual y j – ŷ j or its square. The residual may be used by the “learner” to improve upon the model (e.g., going from ordinary least squares to weighted least squares, using the residuals in the weighting) or to assess the model (e.g., using squared residuals to calculate mean square error). In terms of probability theory, supervised learning can be regarded as estimating the conditional distribution of Y given X = x, for all (or many) x in the support of X.

CPH Dr. Charnigo Chap. 14 Notes In contrast, an unsupervised learning problem concerns characterization of the marginal distribution of X. (If X is a vector containing X 1 and X 2, we might also say the “joint” distribution of X 1 and X 2. The essential points, however, are that no Y appears here and that the distribution to be characterized is not conditional.) In today’s session, we’ll focus on a type of unsupervised learning called clustering.

CPH Dr. Charnigo Chap. 14 Notes Clustering attempts to separate a collection of objects – we might call them training observations, but “training” is now a misnomer – into rather homogeneous groups. Figure 14.4 on page 502 is illustrative. While this plot may look like graphs you have seen before, the difference is that the orange, green, and blue labels are not indicative of actual, pre-existing categories into which the observations fell. Rather, we ourselves define the categories by clustering.

CPH Dr. Charnigo Chap. 14 Notes Possibly we may hope that the categories we define are predictive of a future outcome. For instance, suppose that X 1 and X 2 in Figure 14.4 are biomarkers collected on people who were healthy 20 years ago. If many of the people whose observations we’ve labeled green quickly acquired a particular disease, while some of the people in blue acquired it later and most of the people in orange never acquired it, then this sort of clustering might be useful for risk stratification.

CPH Dr. Charnigo Chap. 14 Notes In effect, what we see here is a form of dimension reduction: the present unsupervised learning converts X into a single discrete variable, which may be retained for use as a predictor in future supervised learning. If one had a specific outcome in mind, an alternative data analysis procedure might eschew the unsupervised learning altogether in favor of, say, a discriminant analysis using the categories for the specific outcome.

CPH Dr. Charnigo Chap. 14 Notes However, an approach based on unsupervised learning has at least two potential advantages. First, an outcome which is not inherently categorical (e.g., time to event) can be accommodated. Second, and perhaps more importantly, exploratory analyses with a wide variety of outcomes are possible. If one happens upon an outcome which is very well predicted by the categories which one has defined, then one may acquire a better understanding of the underlying biological or physical mechanisms.

CPH Dr. Charnigo Chap. 14 Notes Alternatively or additionally, after we have placed people into categories via unsupervised learning, we may seek to identify other variables not in X which predict (or are related to) category membership. An example of this type of analysis appears in components-in-a-mixture-model-for- birthweight%20distribution pdf.

CPH Dr. Charnigo Chap. 14 Notes The K means algorithm used to prepare Figure 14.4 is a commonly employed tool in unsupervised learning. The details are presented on pages 509 and 510. In brief, K represents the intended number of categories, and we assume that X consists of quantitative features, all of which have been suitably standardized. (By “suitably” I mean that Euclidean distance between two values of X is an appropriate measure of separation.)

CPH Dr. Charnigo Chap. 14 Notes The goal is to assign observations to clusters, such that a quantity involving squared distances from each observation to the mean of its cluster is minimized. This is mathematically represented in formula (14.33). The aforementioned minimization takes place by repetition of two steps outlined at the top of p Several iterations may be required since the re- assignment of an observation from one cluster to another will change the means of the affected clusters.

CPH Dr. Charnigo Chap. 14 Notes Besides placing observations into similar groups, we may also wish to establish a hierarchy. This is illustrated by Figure on page 522. Note that we now actually have several sets of clusters; there is not a single, fixed K. For example, one set of (two) clusters is defined by separating all observations originating from the top left branch of the dendrogram from all observations originating from the top right.

CPH Dr. Charnigo Chap. 14 Notes Another set of (five) clusters breaks up the aforementioned set of (two) clusters by separating “LEUKEMIA” from the rest of the observations originating from top left and separating two other observations from the rest originating from top right. The authors note two general classes of methods for hierarchical clustering, agglomerative and divisive.

CPH Dr. Charnigo Chap. 14 Notes The results of hierarchical clustering may be very sensitive both to small changes in the data (a weakness shared by the regression/classification tree in supervised learning) and the choice of method. The latter point is illustrated by Figure on p. 524, which shows the results of three different agglomerative methods. The authors use formula (14.45) to argue that one of these methods (“group average clustering”) has a large-sample probabilistic interpretation which the other two methods lack.

CPH Dr. Charnigo Chap. 14 Notes Let me briefly describe the group average clustering. We start with each of the n observations as its own cluster. We find the two observations which are closest to each other in Euclidean distance and merge them. Then we have n-1 clusters. We find the two clusters which are closest together in the sense of formula (14.43) on page 523. We merge these two clusters.

CPH Dr. Charnigo Chap. 14 Notes The process continues. We have n-2 clusters, then n-3 clusters, and eventually just 1 cluster. Figure provides a microarray example with group average clustering applied to both the persons (here X is 6830-dimensional and n=64) and the genes (here X is 64-dimensional and n=6830). This sort of analysis addresses questions (a) and (b) on page 5 of the text.