Rank-Based Approach to Optimal Score via Dimension Reduction Shao-Hsuan Wang National Taiwan University, Taiwan Nov
Rank-based measures Kendall’s Concordance Index Rank correlation Widely used in medical statistics, epidemiology, economics, and sociology, etc. 2
Rank-based measures Regression Model Y : a univariate response Z Z (Z(Z,, Z ) : multiple covariates 1 p 3
Rank-based measures YRYR Response Y TZTZ TZRTZR Composite score 4
Rank-based measures (Y(Y T, Z ) ( Y, Z ) For pair of observations concordant : T 1 and 2 T, Y Y and T Z Z Y Y and T Z Z 1212 discordant : T Y Y and T Z Z Y T Y and T Z Z
Rank-based measures Kendall’s P(Y P(Y T Y, Z T Z ) P ( Y Y T, Z T Z)Z) 1212 Rank correlation T rc P ( Y Y, Z Z ) Concordance Index TT CI P ( Z Z | Y Y )
Rank-based measures YRYR Response Y TZTZ TZRTZR Composite score 7
Rank-based measures There could not exist a monotonic association !! 8
Motivation
Composite score TZTZ g (Z) measurable functions 10
C-max YRYR Response Concordance-index function : C ( g ) P ( g ( Z g(Z)Rg(Z)R Composite score ) g ( Z )| Y Y ) C (g)(g) C-max : max sup gFgF c Optimal score : m ( Z ) such that m sup C ( g ) g F 11 c
Intrinsic model behind Rank-based measures M1 Distributional assumption : Generalized Regression Model (Han 1987) M2 Structural assumption : Dimension Reduction (Li 1991, Cook 1991) 12
Intrinsic model behind Rank-based measures M1 a non-degenerate monotonic function on R YG(mYG(m d (Z),)(Z),) 0 13
Intrinsic model behind Rank-based measures M1 a non-degenerate monotonic function on R YG(mYG(m d (Z),)(Z),) 0 an unspecifed bivariate function strictly increasing at each component for the other one being fixed 14
Intrinsic model behind Rank-based measures M2 Y D G ( m d (Z),)(Z),) 0 a multivariate polynomial of the unknown degree d 0 15
Intrinsic model behind Rank-based measures M2 Dimension Reduction m(Z)m(Z) T m(Bm(B Z)Z) d dk (1) d 0 be the smallest degree such that YZYZ | m d (Z)(Z) 0 B (2) 0 {{ 01,, 0k00k0 } is a basis of the central subspace (CS) 16
Model Flexibility Linear regression model Y T 0 Z T Binary Choice model Accelerated Failure time model Y I ( log( Y ) 0 Z T 0 0) Z Generalized linear regression model (GLM) Non-monotonic regression model Y(Y( T 0 2 Z)Z) 17
Types of covariates all discrete but continuous covariates Covariates which moments could not exist 18
Theories Propositions: (1) Existence m ( Z arg max C ( g ) d0d0 g (2) Uniqueness f ( Z ) arg max C ( g ) f(Z)f(Z) cm ( Z ) c (3) Optimality d0d0 g for a ploynomial f d0d0 d01d01 ( z ) of the degree d02d02 d 0 g( Z ) arg max C ( g ) g(Z)T(m(Z))g(Z)T(m(Z)) d g for some monotonic function T 0 19
Summary TZTZ could not be the best composite score Model flexibility Various types of covariates Optimal score : existence, uniqueness, and optimality 20
How to estimate d k 0 0 : structural degree : structural dimension S(BS(B ) : the central subspace 0 m ( BZ ) : the optimal score d k 0 C 0 max : the C-max
Estimation Procedure
Derive m ( Z ) by maximizing the concordance index function via Step1 d the generalized single-index form of the polynomial Tips:(1) dpdp m(Z)cZm(Z)cZ rjrj T Z d r 1p1p r0r1rprj1r0r1rprj1 n I(I( T Z T Z,Y Y ) (2) C (m ( Z )) C()C() ijij i1j1i1j1 nd 0 nn n i1j1i1j1 I (Y Y ) ijij
Estimation Procedure Step 2 Apply the outer grandient approach to obtain B Tips :(1) T k m (u) mm (B(Bu)u) d dk (2) col( S ( B )) col( m ( u )( m T (u)) dW (u)) 0 p uRuR d0d0d0d0
Estimation Procedure Step 3 Derive the estimator of Tips :(1) m dk T (B(B k Z)Z) ZBTZBTZ k n I(I( T Z T Z,Y Y ) (2) ˆ arg max (3) T i1j1i1j1 T i n i1j1i1j1 jijjij I (Y Y ) ijij m (BZ) ˆ Z dk k
Estimation Procedure Step 4 Adopt the concordance-based generalized BIC to estimate T d, k, S ( B ), m ( BZ ), and C 000d0k0000d0k0 Tips : (1) IC ( d, k ) 0max T nC (m(B log n kdkd Z )) (C 1) ndkk with IC (0, k ) 1/2 (2) (d,k) arg max IC ( d, k ) 0 d,1 p 1 2 k
Asymptotic results Consistent model selection --- parsimonious model among the class of Correct models (d(d,k,k ) 0 0 n -consistency of estimators of T S ( B ) and m ( B Z)Z) 0 Asymptotic normality of estimators of C d0k00d0k00 max 27
Wine Data Vinho verde wine : red wine and white wine (from the Minho Region of Northern Portugal) Collected from May/2004 -February/2007 Red wine : sample size (n)=1599 White wine : n=4898 Physicochemical and sensory tests
Wine data Response (Y): Preferences 0 (bad) -10 (excellent) 11 Covariates (Z) : fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, PH, sulphates, and alcohol
Wine data 30
Wine data 31
Thank You !