Download presentation
Presentation is loading. Please wait.
1
Example 8.7 Cluster Analysis
2
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 CLUSTERS.XLS n This file contains demographic data on 49 of the largest cities in the United States. n Some of the data appears in the shaded region of the figure on the next slide. n For example, Atlanta is 67% Black, 2% Hispanic, and 1% Asian. It has a median age of 31, a 5% unemployment rate, and a per capita income of $22,000. n We would like to group these 49 cities into four clusters of cities that are demographically similar.
3
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8
4
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 CLUSTERS.XLSCLUSTERS.XLS -- continued n The basic idea is to choose a city to “anchor” or “center” each cluster. n We then assign each city to the “nearest” cluster center, where “nearest” is defined in terms of the six demographic variables. n The objective is then to minimize the sum of the squared distances from each city to its cluster anchor.
5
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Solution n The first problem is that if we use raw units, percentage black and Hispanic will drive everything because these values are more spread out than the other demographic attributes. n We can see this by calculating means and standard deviations of the characteristics with the AVERAGE and STDEV functions. n The figure on the next slide shows these calculations.
6
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8
7
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Solution -- continued n To remedy this problem we “standardize” each demographic attribute by subtracting the attribute’s mean and dividing the difference by the attribute’s standard deviation. n For example, the average city has 24.347% blacks with a standard deviation of 18.11%. n Thus on a standardized basis, Atlanta is larger by (67-24.347)/(18.11 = 2.355 standard deviations on the percentage black attribute than a typical city.
8
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Solution -- continued n By working with standardized values for each attribute, we ensure that the analysis will be unit-free. To create the standardized values shown in the table, enter the formula =(C15-AVERAGE(C$15:C$63))/STDEV(C$15:C$63) in cell I15 and copy it across to column N and down to row 63.
9
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Developing the Model n Now that we have standardized values for all of the attributes, we can develop the spreadsheet model as follows. n The model is shown in two parts on the next two slides.
10
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8
11
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8
12
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Developing the Model -- continued n The model can be created by following these steps. –Lookup table. One key to the model is to have an index (1 to 49) for the cities so that we can refer to them by index and then look up their characteristics with a VLOOKUP function. Therefore, name the range A15:N63 as Ltable. –Decision variables. The only changing cells appear in the Centers range of the figure. They are the indexes of the four cities chosen as cluster centers. Enter any four integers from 1 to 49 in these cells.
13
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Developing the Model -- continued –Corresponding cities and standardized attributes. We find the names and standardized attributes of the cluster centers with VLOOKUP functions. First, enter the function VLOOKUP(B6,Ltable,2) in cell A6 and copy it to the range A6:A9. Then enter the formula =VLOOKUP($B6,Ltable,C$4) in C6 and copy it to the range C6:H9. Note for example, that the standardized PctBlack is the 9 th column of the lookup table. This explains the “column offset” entries in row 4. –Squared distances to centers. The next step is to see how “far” each city is from each of the cluster centers. Let z i be standardized attribute i for a typical city, and let c i be standardized attribute i for a typical cluster center.
14
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Developing the Model -- continued –We measure the distance from this city to this cluster center with the usual “Euclidean” distance formula where the sum is over all six attributes. We can work just as well with squared distances appear in columns P through S of the last figure. For example, the value in cell P15 is the squared distance from Albuquerque to the first cluster center (Los Angeles), the value in Q15 is the squared distance from Albuquerque to the second equivalent ways. Probably the quickest way is to enter the formula =SUMPRODUCT(I15:N15-$C6:$H$6,I15-$C$6:$H$6) in cell P15 and copy it down column P.
15
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Developing the Model -- continued –This rather novel use of the SUMPRODUCT function sums the products of the differences with the differences – that is, it sums the squares of the differences, exactly what we want. Then enter similar formulas in columns Q, R, and S. For example, the formula in column Q refers to row 7 instead of row 6 in the absolute references. –Assignments to cluster centers. Each city will be assigned to the cluster center that has the smallest squared distance. Therefore, find the minimum squared distances in column T by entering the formula =MIN(P15:S15) in cell T15 and copying it down. Then identify the cluster index (1 through 4) and city name of the cluster center that yields the minimum. We can use the MATCH function to obtain the cluster index. Enter the formula =MATCH(T15,P15:S15,0) in cell U15 and copy it down.
16
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Developing the Model -- continued –For example, the 4.47 minimum squared distance for Albuquerque corresponds to the second squared distance, so Albuquerque is assigned to the second cluster center. Finally, to get the name of the second cluster center, we can use the INDEX function. Enter the formula =INDEX(CenterNames,U15,1) in cell V15 and copy it down. –Sum of squared distances. The objective is to minimize the sum of squared distances from all cities to the cluster centers to which they are assigned. Calculate this objective in the SumSqDists cell with the formula =SUM(MinSqDists).
17
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Using the Evolutionary Solver n The Solver dialog box should be set up as shown here.
18
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Using the Evolutionary Solver -- continued n Because the changing cells represent indexes of cluster centers, they must be integer-constrained, and suitable lower and upper limits are 1 and 49. n Make sure you set the Evolutionary Solver options as we described in Example 8.1. This problem is considerably harder to solve, and we want to allow the Solver plenty of time to search through a lot of potential solutions.
19
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Solution n The solution shown, which uses Los Angeles, Omaha, Memphis, and San Francisco, is the best we found. n You might find a slightly different solution, depending on your Solver settings and how long you let Solver run, but you should obtain a similar value in the target cell. n If you look closely at the cities assigned to each cluster center, this solution begins to make intuitive sense as seen in the figure on the next slide.
20
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8
21
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Solution -- continued n The San Francisco cluster consists of rich, older, highly Asian cities. The Memphis cluster consist of highly black cities with high unemployment rates. The Omaha cluster consists of average income cities with few minorities. The Los Angeles cluster consists of highly Hispanic cities with high unemployment rates. n Why four clusters? We could easily try three clusters or five clusters. Note that when we add a cluster, the sum of squared distances will certainly decrease.
22
8.18.1 | 8.2 | 8.3 | 8.4 | 8.5 | 8.6 | 8.88.28.38.48.58.68.8 Solution -- continued n In fact, we could obtain an objective value of 0 by using 49 clusters, one for each city, but this would hardly provide much information! n Therefore, to choose the “optimal” number of clusters, we would stop adding clusters when the sum of squared distances failed to decrease by a substantial amount.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.