Presentation is loading. Please wait.

Presentation is loading. Please wait.

STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara.

Similar presentations


Presentation on theme: "STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara."— Presentation transcript:

1 STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara Naylor, Christian J. Stoeckert, Jr., Barbara L. Weber, John M. Maris, Gregory R. Grant University of Pennsylvania Children’s Hospital of Philadelphia MGED 8 Meeting Bergen, Norway September 11-13, 2005

2 Background Gain and loss of chromosomal DNA occurs in many cancers Regions of recurrent gain or loss contain genes critical to the genesis and/or progression of cancer Accurate identification of such regions is essential for prioritizing follow-up efforts Array Comparative Genomic Hybridization (aCGH) is a method for detecting genomic copy number variation on a genome-wide scale with high resolution BAC, cDNA, ROMA, Affymetrix SNP chips, Agilent technology

3 Samples Chromosome 8 Researchers traditionally rely on a simple frequency threshold to identify “significant” regions of gain/loss This is followed by tedious manual review of the regions to define boundaries This process is time consuming at best, lacks statistical control, is subject to investigator bias, and may miss essential regions Selecting significant aberrations across samples

4 Research Goal Develop a statistical method for assessing the significance of consistent copy number aberrations across multiple samples Validate this method using known biology and comparison to traditional methods

5 Example Data and Terminology A location is a fixed width stretch of genomic DNA (eg. 1 Mb) Experiments/samples are plotted along the vertical axis; one per row A sequence of one or more aberrant locations is called an aberrant interval We call a set of intervals for a given sample a profile for that sample

6 The Problem Find locations which have more intervals (gains/losses) covering them than would be expected by chance True underlying aberration rate is unknown Take the observed aberrations as given and test for the significance of consistent aberrations across samples

7 Statistical Approach Null Model : observed intervals of aberration are equally likely to occur anywhere in the stretch of the genome being considered General Approach: (1) Choose an appropriate statistic (2) Apply a permutation procedure under the null model to estimate a null distribution of the statistic (3) Assess the (multiple testing corrected) significance of observed values of the statistic by comparing to the null distribution Permutation : random rearrangement of intervals within each profile

8 Frequency statistic results freq = 9 Need statistic sensitive to tight alignment, even if the aberration is not significantly frequent

9 The footprint statistic Stack : set S of aligned intervals containing at most one interval per profile and with at least one location common to all intervals Footprint: F(S) = the number of locations c such that c is contained in some interval of stack S In practice, F(S) is normalized: NF(S) = F(S)/E(F(S)) Null Distributions: Find the minimal NF(S) for each (sample) subset size using a heuristic search use distributions to assign (multiple testing corrected) p-values to locations (details omitted)

10 Footprint statistic results footprint statistic coupled with search strategy reveals locations significantly consistent within subsets p-value = 0.0001 p-value = 0.0050

11 INPUT: matrix of binary gain/no change (or loss/no change) calls for each location along a chromosome arm OUTPUT: for each location along chromosome arm: a) the best stack covering that location b) two p-values for that location (one for each statistic) STAC Algorithm Specification Samplechr1:1-1000000chr1:1000001-2000000 1712DZ1T1001 1714DZ1T1001...

12 Validation Data UPenn BAC Array (Greshock et al. 2004, Gen. Res.) ~4,200 BAC Clones 1. 69% BAC end sequenced 2. 28% STS Mapping 3. 3% Full BAC Sequence Spacing: ~0.91 Mb (chrs 1-X) aCGH BAC Coverage (chr13) Publicly available data sets: 42 Neuroblastoma cell lines (Mosse et al. 2005, Genes Chr Cancer) 47 Primary sporadic breast tumors (Naylor et al. 2005, submitted)

13 Traditional Processing – Many Samples 1. Define regions of aberration for each sample 2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

14 Traditional Processing – Many Samples 90% 70%90% 60% Example Common Regions of Aberration 1. Define regions of aberration for each sample 2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

15 Breast Cancer 92% (11/12) gain regions 85% (11/13) loss regions also 86% (47/55) of the gains (suppl. data) Avg pval gain: 0.00549 loss: 0.00899 Boundaries differ by < 1 Mb on average and in several cases are narrowed by STAC Validation Neuroblastoma 83% (19/22) gain regions 100% (12/12) loss regions Avg pval gain: 0.00447 loss: 0.00719 STAC identifies prognostically relevant regions in neuroblastoma. Shown: MYCN amplification at 2p24. 2p gain

16 Additional Regions Identified Neuroblastoma  94 Gains covering 341 Mb  80 Losses covering 305 Mb Neuroblastoma 94 Gains covering 341 Mb 80 Losses covering 305 Mb Breast Cancer  149 Gains covering 525 Mb  124 Losses covering 384 Mb

17 Regions segregate with known biology Neuroblastoma Cell Lines 646 Mb of significant locations scored (gain, loss, no change) Agglomerative hierarchical, Pearson correlation, complete linkage Evidence for 2 sample clusters - Cluster 1 characterized by pattern of loss - Cluster 2 characterized by pattern of gain * missed by traditional method

18 Future Plans Release stand alone Java version of STAC Extend STAC to account for high-level gains and homozygous deletions Extend STAC to account for high-level gains and homozygous deletions Extend STAC to handle stacks with 2 or more intervals per profile ( co-occurring aberrations ) Extend STAC to handle stacks with 2 or more intervals per profile ( co-occurring aberrations ) http://www.cbil.upenn.edu/STAC


Download ppt "STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara."

Similar presentations


Ads by Google