Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 2 Outline of Presentation Introduction: –Mass Spectrometry Data –Studies objectives and questions Statistical Processing of MS Data –Sample normalization –Removal of peak-specific batch and other temporal trends –Filtering of noisy peaks Design Considerations –Power calculations – for univariate biomarkers –Power calculations for multivariate biomarkers (regression)
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 3 Measurements: chemical compounds of different classes (proteins, lipids, polar and non-polar metabolites, amino acids, etc.) The variables constituting the data sets are peak intensities (peaks) identified by m/z and retention time. The peak intensities are proportional to the amount of analyte detected by the mass spectrometer. Note that p >> n! MS of Individual Peaks Total Ion Chromatogram Selected Ion Chromatogram Figure modified from: biological samplesQC samples Mass Spectrometry Data
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 4 Questions Design Experiment Statistical Processing Data Analysis Objectives Structure of a Molecular Biomarker Discovery Study
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 5 Questions Design Experiment Processing Analysis Objectives Questions DiagnosisElucidation of Mechanisms of Action (MoA) What is a minimal set of biomarkers? What are all the biomarkers? What are the molecular pathways? Questions Biomarker: A characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic response(s) to a therapeutic intervention. Studies Objectives and Questions
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 6 Outline of Presentation Introduction: –Mass Spectrometry Data –Studies objectives and questions Statistical Processing of MS Data –Sample normalization –Removal of peak-specific batch and other temporal trends –Filtering of noisy peaks Design Considerations –Power calculations – for univariate biomarkers –Power calculations for multivariate biomarkers (regression)
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 7 Sample normalization –correction of baseline differences between samples Removal of peak-specific batch and other temporal trends –due to instrument and processing limitations, samples are acquired sequentially in batches – peaks exhibit batch-to-batch variation; –instrument performance may become unstable over time, samples may undergo degradation. These are main causes for temporal variation observed in peak intensities. Filtering of noisy peaks –for each biological sample replicate measurements are obtained; –the estimated correlation between these replicates is used as a filter for noisy data. Statistical Processing Presented at IBC’s Biomarkers and Molecular Diagnostic conferences September 2006
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 8 Correction of baseline differences between samples. Based on Internal Standards. Internal Standards are known exogenous compounds, added to the biological samples in fixed amounts at the beginning of the sample preparation stage (same for all samples). Used to account for sample variability (e.g., pipetting errors) during sample preparation and acquisition. Sample Normalization
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 9 Typical Sample Profiles of IS Peaks – before Normalization
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 10 Normalization – the statistical procedure of multivariate scaling of samples based on (a subset of) IS peaks. Y = log(intensity); i = 1,…,I IS peak; j = 1,…,J sample. The sample-specific factors,, are estimated in this ANOVA model and removed from all peaks. Sample Normalization
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 11 Through normalization, temporal trends common to all peaks are removed. Typical Sample Profiles of IS Peaks – after Normalization
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 12 Typical Temporal Profiles of IS Peaks – before Normalization
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 13 Typical Temporal Profiles of IS Peaks – after Normalization
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 14 Sample normalization –correction of baseline differences between samples Removal of peak-specific batch and other temporal trends –due to instrument and processing limitations, samples are acquired sequentially in batches – peaks exhibit batch-to-batch variation; –instrument performance may become unstable over time, samples may undergo degradation. These are main causes for temporal variation observed in peak intensities. Filtering of noisy peaks –for each biological sample replicate measurements are obtained; –the estimated correlation between these replicates is used as a filter for noisy data. Statistical Processing
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 15 Peak-Specific Temporal Trends – after Normalization
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 16 The within and between batch patterns cause visible batch separations: If one does not account for these intrinsic experimental trends, important biological effects may be obscured. The Need for Batch Corrections PCA Plot: Data set after Normalization Colored by Batch first principal component second principal component
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 17 Based on QC samples (ideally) –QC samples: a pool of material from the biological samples in a study, aliquoted into a set of identical samples that are acquired at specific intervals in each batch of samples. Removal of Peak-Specific Temporal Trends
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 18 Temporal trend within batch b (b=1,…,B batches): estimated based on QC samples within batch b Removal of Peak-Specific Temporal Trends
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 19 Sample normalization –correction of baseline differences between samples Removal of peak-specific batch and other temporal trends –due to instrument and processing limitations, samples are acquired sequentially in batches – peaks exhibit batch-to-batch variation; –instrument performance may become unstable over time, samples may undergo degradation. These are main causes for temporal variation observed in peak intensities. Filtering of noisy peaks –for each biological sample replicate measurements are obtained; –the estimated correlation between these replicates is used as a filter for noisy data. Statistical Processing
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 20 When the same sample is measured several times, we require the measurements to correlate well. The correlation between replicates can be expressed as a tradeoff between the biological variance ( ) and the measurement error variance ( ). Ideal case: no measurement error . The estimated correlation,, can be used to filter noisy peaks. Correlations between Biological Replicates
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 21 Examples of Correlations (two extremes)
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 22 Outline of Presentation Introduction: –Mass Spectrometry Data –Studies objectives and questions Statistical Processing of MS Data –Sample normalization –Removal of peak-specific batch and other temporal trends –Filtering of noisy peaks Design Considerations –Power calculations – for univariate biomarkers –Power calculations for multivariate biomarkers (regression)
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 23 The power in biomarker discovery studies is a function of: –The sample size –The separation between the groups (e.g., MFC) –The proportion of biomarkers in the data set –The false discovery rate (FDR) allowed –The platform variability –The within-group variability –Other factors (e.g. other covariates in the model) ? Power Calculations Statistical power = probability to detect biomarkers
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 24 The power in biomarker discovery studies is a function of: –The sample size –The separation between the groups (e.g., MFC) –The proportion of biomarkers in the data set –The false discovery rate (FDR) allowed –The platform variability –The within-group variability –Other factors (e.g. other covariates in the model) ? Power Calculations Statistical power = probability to detect biomarkers
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 25 : MFC = 1.7 : MFC = 2.0 : MFC = 3.0 solid: FDR 0.1 dashed: FDR 0.2 Illustration I: Power Curves
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 26 : MFC = 1.7 : MFC = 2.0 : MFC = 3.0 solid: FDR 0.1 dashed: FDR 0.2 Illustration I: Power Curves
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 27 : MFC = 1.7 : MFC = 2.0 : MFC = 3.0 dotted: Estimated FDR There is no loss in power, (proportion of biomarkers discovered) BUT the FDR may be undesirable. FRD Power Curves Not Accounting for the FDR
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 28 Power Calculation for Multivariate Biomarkers (Regression) Classical Setting n > p Linear regression model Parametric (F) test of model significance Computationally inexpensive Biomarker Discovery Setting n << p Regression with constraints on parameters (elastic net) Dimensionality reduction needed (through cross- validation) Non-parametric (label permutations) test of model significance Computationally very expensive
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 29 Illustration: Power for Regression Model Multivariate biomarker Parameter of interest Test: = 0 Power = proportion of times that this hypothesis is rejected
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 30 Power Calculation – Regression rho Number of Samples Power Biomarker with 10 Components (known in advance) …10 minutes to calculate Biomarker with 10 Components (buried among 90 other analytes) …days to calculate
Innovative Paths to Better Medicines Confidential Information – Do Not Reproduce or Distribute – page 31 Thank you!