Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota

Similar presentations


Presentation on theme: "Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota"— Presentation transcript:

1 Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota doug@stat.umn.edu

2 What are ‘large’ arrays? # of rows in at least hundreds # of rows in at least hundredsand/or # of columns in at least hundreds # of columns in at least hundreds

3 Challenges/Opportunities Logistics of handling data more tedious Logistics of handling data more tedious Standard graphic methods work less well Standard graphic methods work less well More opportunity for assumptions to fail More opportunity for assumptions to failbut Parameter estimates more precise Parameter estimates more precise Fewer model assumptions maybe possible Fewer model assumptions maybe possible

4 Settings Microarray data Microarray data Proteomics data Proteomics data Spectral data (fluorescence, absorption…) Spectral data (fluorescence, absorption…)

5 Common problems seen Outliers/Heavy-tailed distributions Outliers/Heavy-tailed distributions Missing data Missing data Large # of variables hurts some methods Large # of variables hurts some methods

6 The ovarian cancer data Data set as I have it: Data set as I have it: 15154 variables (M/Z values), % relative intensity recorded 15154 variables (M/Z values), % relative intensity recorded 91 controls (clinical normals) 91 controls (clinical normals) 162 ovarian cancer patients 162 ovarian cancer patients

7 The normals Give us an array of 15154 rows, 91 columns. Give us an array of 15154 rows, 91 columns. Qualifies as ‘large’ Qualifies as ‘large’ Spectrum very ‘busy’ Spectrum very ‘busy’

8

9 not to mention outlier-prone Subtracting off a median for each MZ and making a normal probability plot of the residuals Subtracting off a median for each MZ and making a normal probability plot of the residuals

10

11 Comparing cases, controls First pass at a rule to distinguish normal controls from cancer cases: First pass at a rule to distinguish normal controls from cancer cases: Calculate two-sample t between groups for each distinct M/Z Calculate two-sample t between groups for each distinct M/Z

12

13 Good news / bad news Several places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation) Several places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation) Visually seem to be isolated spikes Visually seem to be isolated spikes This is due to large # of narrow peaks This is due to large # of narrow peaks

14

15 Variability also differs

16 Big differences in mean and variability Big differences in mean and variability suggest conventional statistical tools of suggest conventional statistical tools of –Linear discriminant analysis –Logistic regression –Quadratic or regularized discriminant analysis using a selected set of features. Off-the-shelf software doesn’t like 15K variables, but methods very do-able.

17 Return to beginning Are there useful tools for extracting information from these arrays? Are there useful tools for extracting information from these arrays? Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports) Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports)

18 Singular value approximation Some philosophy from Bradu (1984) Some philosophy from Bradu (1984) Write X for nxp data array. Write X for nxp data array. First remove structure you don’t want to see First remove structure you don’t want to see k-term SVD approximation is k-term SVD approximation is

19 The r it are ‘row markers’ You could use them as plot positions for the proteins The r it are ‘row markers’ You could use them as plot positions for the proteins The c jt are ‘column markers’. You could use them as plot positions for the cases. They match their corresponding row markers. The c jt are ‘column markers’. You could use them as plot positions for the cases. They match their corresponding row markers. The e ij are error terms. They should mainly be small The e ij are error terms. They should mainly be small

20 Fitting the SVD Conventionally done by principal component analysis. Conventionally done by principal component analysis. We avoid this for two reasons: We avoid this for two reasons: –PCA is highly sensitive to outliers –It requires complete data (an issue in many large data sets, if not this one) –Standard approach would use 15K square covariance matrix.

21 Alternating robust fit algorithm Take trial values for the column markers. Fit the corresponding row markers using robust regression on available data. Take trial values for the column markers. Fit the corresponding row markers using robust regression on available data. Use resulting row markers to refine column markers. Use resulting row markers to refine column markers. Iterate to convergence. Iterate to convergence. For robust regression we use least trimmed squares (LTS) regression. For robust regression we use least trimmed squares (LTS) regression.

22 Result for the controls First run, I just removed a grand median. First run, I just removed a grand median. Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators

23

24  But the subsequent terms capture the finer structure

25

26

27 Uses for the RSVD Instead of feature selection, we can use cases’ c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias. Instead of feature selection, we can use cases’ c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias. Can use as the basis for methods like cluster analysis. Can use as the basis for methods like cluster analysis.

28 Cluster analysis use Consider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines) Consider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines)

29 The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. Second term noise. Adds no information, detracts from performance Second term noise. Adds no information, detracts from performance Third term, cross-product, approximates zero because of independence. Third term, cross-product, approximates zero because of independence.

30 This leads to… r,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into c r,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into c Replace column Euclidean distance with squared distance between column markers. This removes random variability. Replace column Euclidean distance with squared distance between column markers. This removes random variability. Similarly, for k-means/Kohonen, replace column profile with its SVD approximation. Similarly, for k-means/Kohonen, replace column profile with its SVD approximation.

31 Special case If a one term SVD suffices, we get an ordination of the rows and columns. If a one term SVD suffices, we get an ordination of the rows and columns. Row ordination doesn’t make much sense for spectral data Row ordination doesn’t make much sense for spectral data Column ordination orders subjects ‘rationally’. Column ordination orders subjects ‘rationally’.

32 The cancer group Carried out RSVD of just the cancer Carried out RSVD of just the cancer But this time removed row median first But this time removed row median first Corrects for overall abundance at each MZ Corrects for overall abundance at each MZ Robust singular values are 2800, 1850, 1200,… Robust singular values are 2800, 1850, 1200,… suggesting more than one dimension. suggesting more than one dimension.

33

34 No striking breaks in sequence. No striking breaks in sequence. We can cluster, but get more of a partition of a continuum. We can cluster, but get more of a partition of a continuum. Suggests that severity varies smoothly Suggests that severity varies smoothly

35 Back to the two-group setting An interesting question (suggested by Mahalanobis-Taguchi strategy) – are cancer group alike? An interesting question (suggested by Mahalanobis-Taguchi strategy) – are cancer group alike? Can address this by RSVD of cancer cases and clustering on column markers Can address this by RSVD of cancer cases and clustering on column markers Or use the controls to get multivariate metric and place the cancers in this metric. Or use the controls to get multivariate metric and place the cancers in this metric.

36 Do a new control RSVD Subtract row medians. Subtract row medians. Get canonical variates for all versus just controls Get canonical variates for all versus just controls (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls) (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls) Plot the two groups Plot the two groups

37

38 Supports earlier comment re lack of big ‘white space’ in the cancer group – a continuum, not distinct subpopulations Supports earlier comment re lack of big ‘white space’ in the cancer group – a continuum, not distinct subpopulations Controls look a lot more homogeneous than cancer cases. Controls look a lot more homogeneous than cancer cases.

39 Summary Large arrays – challenge and opportunity. Large arrays – challenge and opportunity. Hard to visualize or use graphs. Hard to visualize or use graphs. Many data sets show outliers / missing data / very heavy tails. Many data sets show outliers / missing data / very heavy tails. Robust-fit singular value decomposition can handle these; provides large data condensation. Robust-fit singular value decomposition can handle these; provides large data condensation.

40 Some references


Download ppt "Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota"

Similar presentations


Ads by Google