Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota

Slides:



Advertisements
Similar presentations
Face Recognition Sumitha Balasuriya.
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Dimension reduction (1)
Microarray Normalization
Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Principal component analysis (PCA)
Dimensional reduction, PCA
SNP chips Advanced Microarray Analysis Mark Reimers, Dept Biostatistics, VCU, Fall 2008.
Theodore Alexandrov, Michael Becker, Sören Deininger, Günther Ernst, Liane Wehder, Markus Grasmair, Ferdinand von Eggeling, Herbert Thiele, and Peter Maass.
SVD and PCA COS 323. Dimensionality Reduction Map points in high-dimensional space to lower number of dimensionsMap points in high-dimensional space to.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
HCC class lecture 14 comments John Canny 3/9/05. Administrivia.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Chapter 2 Dimensionality Reduction. Linear Methods
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Dr. Richard Young Optronic Laboratories, Inc..  Uncertainty budgets are a growing requirement of measurements.  Multiple measurements are generally.
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Multimodal Interaction Dr. Mike Spann
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Introduction to the gradient analysis. Community concept (from Mike Austin)
Model Building III – Remedial Measures KNNL – Chapter 11.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Distances Between Genes and Samples Naomi Altman Oct. 06.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Parameter estimation. 2D homography Given a set of (x i,x i ’), compute H (x i ’=Hx i ) 3D to 2D camera projection Given a set of (X i,x i ), compute.
Multivariate Data Analysis Chapter 1 - Introduction.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Lesson The Normal Approximation to the Binomial Probability Distribution.
Parameter estimation class 5 Multiple View Geometry CPSC 689 Slides modified from Marc Pollefeys’ Comp
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
CLASSIFICATION OF ECG SIGNAL USING WAVELET ANALYSIS
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Estimating standard error using bootstrap
PREDICT 422: Practical Machine Learning
Statistical Data Analysis - Lecture /04/03
Exploring Microarray data
Lecture 8:Eigenfaces and Shared Features
Dimension Reduction via PCA (Principal Component Analysis)
Parameter estimation class 5
Parallelization of Sparse Coding & Dictionary Learning
Dimension reduction : PCA and Clustering
Learning Theory Reza Shadmehr
Introduction to Sensor Interpretation
One-Factor Experiments
Mathematical Foundations of BME
Introduction to Sensor Interpretation
Calibration and homographies
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota

What are ‘large’ arrays? # of rows in at least hundreds # of rows in at least hundredsand/or # of columns in at least hundreds # of columns in at least hundreds

Challenges/Opportunities Logistics of handling data more tedious Logistics of handling data more tedious Standard graphic methods work less well Standard graphic methods work less well More opportunity for assumptions to fail More opportunity for assumptions to failbut Parameter estimates more precise Parameter estimates more precise Fewer model assumptions maybe possible Fewer model assumptions maybe possible

Settings Microarray data Microarray data Proteomics data Proteomics data Spectral data (fluorescence, absorption…) Spectral data (fluorescence, absorption…)

Common problems seen Outliers/Heavy-tailed distributions Outliers/Heavy-tailed distributions Missing data Missing data Large # of variables hurts some methods Large # of variables hurts some methods

The ovarian cancer data Data set as I have it: Data set as I have it: variables (M/Z values), % relative intensity recorded variables (M/Z values), % relative intensity recorded 91 controls (clinical normals) 91 controls (clinical normals) 162 ovarian cancer patients 162 ovarian cancer patients

The normals Give us an array of rows, 91 columns. Give us an array of rows, 91 columns. Qualifies as ‘large’ Qualifies as ‘large’ Spectrum very ‘busy’ Spectrum very ‘busy’

not to mention outlier-prone Subtracting off a median for each MZ and making a normal probability plot of the residuals Subtracting off a median for each MZ and making a normal probability plot of the residuals

Comparing cases, controls First pass at a rule to distinguish normal controls from cancer cases: First pass at a rule to distinguish normal controls from cancer cases: Calculate two-sample t between groups for each distinct M/Z Calculate two-sample t between groups for each distinct M/Z

Good news / bad news Several places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation) Several places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation) Visually seem to be isolated spikes Visually seem to be isolated spikes This is due to large # of narrow peaks This is due to large # of narrow peaks

Variability also differs

Big differences in mean and variability Big differences in mean and variability suggest conventional statistical tools of suggest conventional statistical tools of –Linear discriminant analysis –Logistic regression –Quadratic or regularized discriminant analysis using a selected set of features. Off-the-shelf software doesn’t like 15K variables, but methods very do-able.

Return to beginning Are there useful tools for extracting information from these arrays? Are there useful tools for extracting information from these arrays? Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports) Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports)

Singular value approximation Some philosophy from Bradu (1984) Some philosophy from Bradu (1984) Write X for nxp data array. Write X for nxp data array. First remove structure you don’t want to see First remove structure you don’t want to see k-term SVD approximation is k-term SVD approximation is

The r it are ‘row markers’ You could use them as plot positions for the proteins The r it are ‘row markers’ You could use them as plot positions for the proteins The c jt are ‘column markers’. You could use them as plot positions for the cases. They match their corresponding row markers. The c jt are ‘column markers’. You could use them as plot positions for the cases. They match their corresponding row markers. The e ij are error terms. They should mainly be small The e ij are error terms. They should mainly be small

Fitting the SVD Conventionally done by principal component analysis. Conventionally done by principal component analysis. We avoid this for two reasons: We avoid this for two reasons: –PCA is highly sensitive to outliers –It requires complete data (an issue in many large data sets, if not this one) –Standard approach would use 15K square covariance matrix.

Alternating robust fit algorithm Take trial values for the column markers. Fit the corresponding row markers using robust regression on available data. Take trial values for the column markers. Fit the corresponding row markers using robust regression on available data. Use resulting row markers to refine column markers. Use resulting row markers to refine column markers. Iterate to convergence. Iterate to convergence. For robust regression we use least trimmed squares (LTS) regression. For robust regression we use least trimmed squares (LTS) regression.

Result for the controls First run, I just removed a grand median. First run, I just removed a grand median. Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators

 But the subsequent terms capture the finer structure

Uses for the RSVD Instead of feature selection, we can use cases’ c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias. Instead of feature selection, we can use cases’ c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias. Can use as the basis for methods like cluster analysis. Can use as the basis for methods like cluster analysis.

Cluster analysis use Consider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines) Consider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines)

The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. Second term noise. Adds no information, detracts from performance Second term noise. Adds no information, detracts from performance Third term, cross-product, approximates zero because of independence. Third term, cross-product, approximates zero because of independence.

This leads to… r,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into c r,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into c Replace column Euclidean distance with squared distance between column markers. This removes random variability. Replace column Euclidean distance with squared distance between column markers. This removes random variability. Similarly, for k-means/Kohonen, replace column profile with its SVD approximation. Similarly, for k-means/Kohonen, replace column profile with its SVD approximation.

Special case If a one term SVD suffices, we get an ordination of the rows and columns. If a one term SVD suffices, we get an ordination of the rows and columns. Row ordination doesn’t make much sense for spectral data Row ordination doesn’t make much sense for spectral data Column ordination orders subjects ‘rationally’. Column ordination orders subjects ‘rationally’.

The cancer group Carried out RSVD of just the cancer Carried out RSVD of just the cancer But this time removed row median first But this time removed row median first Corrects for overall abundance at each MZ Corrects for overall abundance at each MZ Robust singular values are 2800, 1850, 1200,… Robust singular values are 2800, 1850, 1200,… suggesting more than one dimension. suggesting more than one dimension.

No striking breaks in sequence. No striking breaks in sequence. We can cluster, but get more of a partition of a continuum. We can cluster, but get more of a partition of a continuum. Suggests that severity varies smoothly Suggests that severity varies smoothly

Back to the two-group setting An interesting question (suggested by Mahalanobis-Taguchi strategy) – are cancer group alike? An interesting question (suggested by Mahalanobis-Taguchi strategy) – are cancer group alike? Can address this by RSVD of cancer cases and clustering on column markers Can address this by RSVD of cancer cases and clustering on column markers Or use the controls to get multivariate metric and place the cancers in this metric. Or use the controls to get multivariate metric and place the cancers in this metric.

Do a new control RSVD Subtract row medians. Subtract row medians. Get canonical variates for all versus just controls Get canonical variates for all versus just controls (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls) (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls) Plot the two groups Plot the two groups

Supports earlier comment re lack of big ‘white space’ in the cancer group – a continuum, not distinct subpopulations Supports earlier comment re lack of big ‘white space’ in the cancer group – a continuum, not distinct subpopulations Controls look a lot more homogeneous than cancer cases. Controls look a lot more homogeneous than cancer cases.

Summary Large arrays – challenge and opportunity. Large arrays – challenge and opportunity. Hard to visualize or use graphs. Hard to visualize or use graphs. Many data sets show outliers / missing data / very heavy tails. Many data sets show outliers / missing data / very heavy tails. Robust-fit singular value decomposition can handle these; provides large data condensation. Robust-fit singular value decomposition can handle these; provides large data condensation.

Some references