CPH Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted but is less common.) We observe training data (x 1,y 1 ), (x 2,y 2 ), …, (x n,y n ) and develop a model for predicting Y from X. This model is called a “learner” and presents an “answer” of ŷ j for j = 1 to n.
CPH Dr. Charnigo Chap. 14 Notes The “supervisor” then “grades” the answer, for example by returning the residual y j – ŷ j or its square. The residual may be used by the “learner” to improve upon the model (e.g., going from ordinary least squares to weighted least squares, using the residuals in the weighting) or to assess the model (e.g., using squared residuals to calculate mean square error). In terms of probability theory, supervised learning can be regarded as estimating the conditional distribution of Y given X = x, for all (or many) x in the support of X.
CPH Dr. Charnigo Chap. 14 Notes In contrast, an unsupervised learning problem concerns characterization of the marginal distribution of X. (If X is a vector containing X 1 and X 2, we might also say the “joint” distribution of X 1 and X 2. The essential points, however, are that no Y appears here and that the distribution to be characterized is not conditional.) In today’s session, we’ll focus on a type of unsupervised learning called clustering.
CPH Dr. Charnigo Chap. 14 Notes Clustering attempts to separate a collection of objects – we might call them training observations, but “training” is now a misnomer – into rather homogeneous groups. Figure 14.4 on page 502 is illustrative. While this plot may look like graphs you have seen before, the difference is that the orange, green, and blue labels are not indicative of actual, pre-existing categories into which the observations fell. Rather, we ourselves define the categories by clustering.
CPH Dr. Charnigo Chap. 14 Notes Possibly we may hope that the categories we define are predictive of a future outcome. For instance, suppose that X 1 and X 2 in Figure 14.4 are biomarkers collected on people who were healthy 20 years ago. If many of the people whose observations we’ve labeled green quickly acquired a particular disease, while some of the people in blue acquired it later and most of the people in orange never acquired it, then this sort of clustering might be useful for risk stratification.
CPH Dr. Charnigo Chap. 14 Notes In effect, what we see here is a form of dimension reduction: the present unsupervised learning converts X into a single discrete variable, which may be retained for use as a predictor in future supervised learning. If one had a specific outcome in mind, an alternative data analysis procedure might eschew the unsupervised learning altogether in favor of, say, a discriminant analysis using the categories for the specific outcome.
CPH Dr. Charnigo Chap. 14 Notes However, an approach based on unsupervised learning has at least two potential advantages. First, an outcome which is not inherently categorical (e.g., time to event) can be accommodated. Second, and perhaps more importantly, exploratory analyses with a wide variety of outcomes are possible. If one happens upon an outcome which is very well predicted by the categories which one has defined, then one may acquire a better understanding of the underlying biological or physical mechanisms.
CPH Dr. Charnigo Chap. 14 Notes Alternatively or additionally, after we have placed people into categories via unsupervised learning, we may seek to identify other variables not in X which predict (or are related to) category membership. An example of this type of analysis appears in components-in-a-mixture-model-for- birthweight%20distribution pdf.
CPH Dr. Charnigo Chap. 14 Notes The K means algorithm used to prepare Figure 14.4 is a commonly employed tool in unsupervised learning. The details are presented on pages 509 and 510. In brief, K represents the intended number of categories, and we assume that X consists of quantitative features, all of which have been suitably standardized. (By “suitably” I mean that Euclidean distance between two values of X is an appropriate measure of separation.)
CPH Dr. Charnigo Chap. 14 Notes The goal is to assign observations to clusters, such that a quantity involving squared distances from each observation to the mean of its cluster is minimized. This is mathematically represented in formula (14.33). The aforementioned minimization takes place by repetition of two steps outlined at the top of p Several iterations may be required since the re- assignment of an observation from one cluster to another will change the means of the affected clusters.
CPH Dr. Charnigo Chap. 14 Notes Besides placing observations into similar groups, we may also wish to establish a hierarchy. This is illustrated by Figure on page 522. Note that we now actually have several sets of clusters; there is not a single, fixed K. For example, one set of (two) clusters is defined by separating all observations originating from the top left branch of the dendrogram from all observations originating from the top right.
CPH Dr. Charnigo Chap. 14 Notes Another set of (five) clusters breaks up the aforementioned set of (two) clusters by separating “LEUKEMIA” from the rest of the observations originating from top left and separating two other observations from the rest originating from top right. The authors note two general classes of methods for hierarchical clustering, agglomerative and divisive.
CPH Dr. Charnigo Chap. 14 Notes The results of hierarchical clustering may be very sensitive both to small changes in the data (a weakness shared by the regression/classification tree in supervised learning) and the choice of method. The latter point is illustrated by Figure on p. 524, which shows the results of three different agglomerative methods. The authors use formula (14.45) to argue that one of these methods (“group average clustering”) has a large-sample probabilistic interpretation which the other two methods lack.
CPH Dr. Charnigo Chap. 14 Notes Let me briefly describe the group average clustering. We start with each of the n observations as its own cluster. We find the two observations which are closest to each other in Euclidean distance and merge them. Then we have n-1 clusters. We find the two clusters which are closest together in the sense of formula (14.43) on page 523. We merge these two clusters.
CPH Dr. Charnigo Chap. 14 Notes The process continues. We have n-2 clusters, then n-3 clusters, and eventually just 1 cluster. Figure provides a microarray example with group average clustering applied to both the persons (here X is 6830-dimensional and n=64) and the genes (here X is 64-dimensional and n=6830). This sort of analysis addresses questions (a) and (b) on page 5 of the text.