Delineating Metropolitan Housing Submarkets with Fuzzy Clustering Methods Julie Sungsoon Hwang Department of Geography, University of Washington Jean-Claude Thill Department of Geography, State University of New York at Buffalo November 10, 2005 North American Meetings of Regional Science Association International
Outlines Research objectives Methodology: specification Methodology: illustration Evaluating the performance of fuzzy clustering Conclusions
Research objectives Demonstrate the use of fuzzy c-means (FCM) algorithm for delineating housing submarkets –Comparison to K-means Discuss empirical characteristics of FCM applied to given applications, in particular choice of parameters –Cluster validity index
Challenges Are the boundaries of clusters crisp? Cluster A Cluster C X1X1 X2X2 Housing market in metropolitan area q Cluster B Cluster A Cluster B Cluster C X1X1 X2X2 Housing market in metropolitan area p
Methodology: specification
Our task is to group census tracts to homogeneous housing submarkets within a metropolitan area Using fuzzy c-means algorithm In order to examine whether fuzzy set-based clustering can do the better job Implemented in 85 metropolitan areas Most of data set are public (e.g Census) The whole procedure is automated in GIS
Methodology: flow chart National Regional Local … Census Tract Layer #x1x1 x2x2 x3x3 …xmxm … n #y1y1 y2y2 …ykyk … n Cluster Analysis #U1U1 U2U2 …UcUc 110…0 201…0 …01…0 n00…1 #U1U1 U2U2 …UcUc … … …0.12 n …0.50 K-means Fuzzy C- means Candidate variables Significant variables Stepwise regression(k ≤ m) Metro Hard Cluster Layer (c ≤ n) Fuzzy Cluster Layer … 12c12c k: # selected variables c: # submarkets For each metropolitan area U j : membership to cluster j
Explanatory variables for house price Var_NameVariable DefinitionDataYearSpatial Unit Socioeconomic/demographic Characteristics of Residents pcincomeper capita incomeCensus2000Census Tract college% college degreeCensus2000Census Tract managep% management workersCensus2000Census Tract prodp% production workersCensus2000Census Tract famcpchl% family with childrenCensus2000Census Tract nfmalone% nonfamily living aloneCensus2000Census Tract black_p% blackCensus2000Census Tract nhwht_p% non-hispanic whiteCensus2000Census Tract nativebr% native bornCensus2000Census Tract Structural Characteristics of Housing Units medroommedian number of roomCensus2000Census Tract hudetp% detached housing unitCensus2000Census Tract yrhubltmedian year structure builtCensus2000Census Tract Locational Characteristics (Amenities) of Neighborhoods ptratiopupil to teacher ratioNCES*2002School District schexpschool expenditure per studentNCES2002School District vrlcrimeviolent crime rateFBI**2003Designated Place prpcrimeproperty crime rateFBI2003Designated Place jobacmjob accessibility (Hansen 1959)CTPP***2000Census Tract *National Center for Education Statistics; **FBI annual report “Crime in the U.S. 2003”; *** CTPP: Census Transportation Planning Package Dependent variables: median home value of owner-occupied housing units
Study set: 85 metropolitan areas
Clustering method that minimizes the following objective function: Updates cluster means v i and membership degree u ik until the algorithm converges Vectors of data point, 1 ≤ k ≤ n Center of cluster i, 1 ≤ i ≤ c Membership degree of data point k with cluster i; [0,1] Fuzziness amount associated with assigning data point k to cluster i, 1≤ m ≤ ∞ Source: Bezdek 1981 x1x1 x2x2 What is fuzzy c-means (FCM)? (III-3a) (III-3b)
FCM: missing elements Optimal number of clusters c* Optimal fuzziness amount m* m c FCM
Extended fuzzy c-means algorithm Step 1: Initialize the parameters related to fuzzy partitioning: c = 2 (2 ≤ c cmax), m = 1 (1 ≤ m mmax), where c is an integer, m is a real number; Fix minc where minc is incremental value of m ( 0 < minc ≤ 0.1); Fix cut-off threshold L; Choose validity index v Step 2: Given c and m, initialize U(0) so that it becomes the fuzzy matrix. Then at step l, l = 0, 1, 2, ….; Step 3: Calculate the c fuzzy cluster centers {vi(l)} with (III-3a) and U(l) Step 4: Update U(l+1) using (III-3b) and {vi(l)} Step 5: Compare U(l) to U(l+1) in a convenient matrix norm; if || U(l+1) – U(l) || ≤ L to go step 6; otherwise return to Step 3. Step 6: Compute the validity index for given c and m Step 7: If c < cmax, then increase c c + 1 and go to step 3; otherwise go to step 8 Step 8: If m < mmax, then increase m m + minc and go to step 3; otherwise go to step 9 Step 9: Obtain the optimal validity index from, optimal number of clusters c*, and optimal amount of fuzziness exponent m*; The optimal fuzzy partition U is obtained given c* and m*
Cluster validity indices Partition coefficientPartition entropy Xie-Beni index SVi index where w is set to 2 in this study
Selected validity indices are calibrated over the study set Xie-Beni index is recommended as a validity index Average m* is 1.38 Determining c* and m*
Histogram of m* for FCM
Methodology: illustration
Median home value of Buffalo, NY
Dimensionality of Buffalo housing market PredictorCoefficientStandard Errort-statisticsp-value Constant Per capita income % college degree % family: couple with children % detached housing unit Housing age (year) % non-hispanic white % native born status Job accessibility Hedonic regression equation of median home value in Buffalo, NY Adjusted R sq = 84.3%
Optimal number of housing submarkets c*, Optimal fuzziness amount m*, Buffalo, NY c m c* Values in the cell represent Xie-Beni index given c and m
c* = 3; m* = 1.3 Membership to Cluster 1Membership to Cluster 2 Membership to Cluster 3Defuzzified Clusters Buffalo housing submarkets
Evaluating the performance of fuzzy clustering
Compare the sum of squared error derived from KM (m=1) and FCM (m=m*) given c* Fuzzy clustering outperforms crisp clustering Compare FCM with K-means (KM)
Conclusions Fuzzy set theory provides a mechanism for uncertainty handling involved in classification task Fuzzy c-means algorithm is of practical use in delineating housing submarkets Fuzzy set theory needs further attention in social science fields More works on the choice of parameters are needed