STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara.

Slides:



Advertisements
Similar presentations
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Advertisements

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Cluster analysis for microarray data Anja von Heydebreck.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Bioinformatics lectures at Rice University Li Zhang Lecture 10: Networks and integrative genomic analysis-2 Genome instability and DNA copy number data.
SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative gnome hybridization data.
The neuroblastoma genome Studies of genomic alterations using copy number microarray analyzes Tommy Martinsson Department of Clinical Genetics Sahlgrenska.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Some slides adapted from J. Fridlyand BioSys course: DNA Microarray Analysis – Lecture, 2007 Analysis of Array CGH Data by Hanni Willenbrock.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Genomic Arrays: Tools for cancer gene discovery Ian Roberts MRC Cancer Cell Unit Hutchison MRC Research Centre
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Comparative Genomic Hybridization (CGH). Outline Introduction to gene copy numbers and CGH technology DNA copy number alterations in breast cancer (Pollack.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Large-Scale Copy Number Polymorphism in the Human Genome J. Sebat et al. Science, 305:525 Luana Ávila MedG 505 Feb. 24 th /24.
Page 1 Mouse Genome CGH Microarray 44A. Page 2 Mouse Genome CGH Microarray Kit 44A Designed for CGH, Validated with samples of known aberrations Designed.
Manifestation of Novel Social Challenges of the European Union in the Teaching Material of Medical Biotechnology Master’s Programmes at the University.
Mouse Genome Sequencing
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
CDNA Microarrays MB206.
Microarray - Leukemia vs. normal GeneChip System.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.
We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Correlation Matrix Diagonal Segmentation (CMDS) A Fast Genome-wide Approach for Identifying Recurrent DNA Copy Number Alterations across Cancer Patients.
Computational Laboratory: aCGH Data Analysis Feb. 4, 2011 Per Chia-Chin Wu.
Chapter 6: Analyzing and Interpreting Quantitative Data
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov February 16, 2011.
CHROMOSOMAL MECHANISMS OF TUMOUR PROGRESSION IN OSTEOSARCOMA
CGH Data BIOS Chromosome Re-arrangements.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Copy Number Analysis in the Cancer Genome Using SNP Arrays Qunyuan Zhang, Aldi Kraja Division of Statistical Genomics Department of Genetics & Center for.
CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Gene Expression Profiling Brad Windle, Ph.D
Homozygous deletions within chromosome 9q23.
Recurrent copy number alterations in prostate cancer: an in silico meta-analysis of publicly available genomic data  Julia L. Williams, Peter A. Greer,
Invest. Ophthalmol. Vis. Sci ;52(6): doi: /iovs Figure Legend:
A Genome-Wide High-Resolution Array-CGH Analysis of Cutaneous Melanoma and Comparison of Array-CGH to FISH in Diagnostic Evaluation  Lu Wang, Mamta Rao,
Integrated Cytogenetic and High-Resolution Array CGH Analysis of Genomic Alterations Associated with MYCN Amplification Cytogenet Genome Res 2011;134:27–39.
Genome Wide Association Studies using SNP
Figure 1. Validation of the chromosome 22 array
Some slides adapted from J. Fridlyand
Peter John M.Phil, PhD Atta-ur-Rahman School of Applied Biosciences (ASAB) National University of Sciences & Technology (NUST)
Fig. 8. Recurrent copy number amplification of BRD4 gene was observed across common cancers. Recurrent copy number amplification of BRD4 gene was observed.
Comprehensive Screening of Gene Copy Number Aberrations in Formalin-Fixed, Paraffin-Embedded Solid Tumors Using Molecular Inversion Probe–Based Single-
Discovery tools for human genetic variations
CSCI2950-C Lecture 3 September 13, 2007.
Genomic alterations in breast cancer cell line MDA-MB-231.
Volume 9, Issue 4, Pages (April 2006)
A Genome-Wide High-Resolution Array-CGH Analysis of Cutaneous Melanoma and Comparison of Array-CGH to FISH in Diagnostic Evaluation  Lu Wang, Mamta Rao,
Cyclin E1 Is Amplified and Overexpressed in Osteosarcoma
Longitudinal Study of Recurrent Metastatic Melanoma Cell Lines Underscores the Individuality of Cancer Biology  Zoltan Pos, Tara L. Spivey, Hui Liu, Michele.
SNP Arrays in Heterogeneous Tissue: Highly Accurate Collection of Both Germline and Somatic Genetic Information from Unpaired Single Tumor Samples  Guillaume.
Cancer Cell Line Encyclopedia
Presentation transcript:

STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara Naylor, Christian J. Stoeckert, Jr., Barbara L. Weber, John M. Maris, Gregory R. Grant University of Pennsylvania Children’s Hospital of Philadelphia MGED 8 Meeting Bergen, Norway September 11-13, 2005

Background Gain and loss of chromosomal DNA occurs in many cancers Regions of recurrent gain or loss contain genes critical to the genesis and/or progression of cancer Accurate identification of such regions is essential for prioritizing follow-up efforts Array Comparative Genomic Hybridization (aCGH) is a method for detecting genomic copy number variation on a genome-wide scale with high resolution BAC, cDNA, ROMA, Affymetrix SNP chips, Agilent technology

Samples Chromosome 8 Researchers traditionally rely on a simple frequency threshold to identify “significant” regions of gain/loss This is followed by tedious manual review of the regions to define boundaries This process is time consuming at best, lacks statistical control, is subject to investigator bias, and may miss essential regions Selecting significant aberrations across samples

Research Goal Develop a statistical method for assessing the significance of consistent copy number aberrations across multiple samples Validate this method using known biology and comparison to traditional methods

Example Data and Terminology A location is a fixed width stretch of genomic DNA (eg. 1 Mb) Experiments/samples are plotted along the vertical axis; one per row A sequence of one or more aberrant locations is called an aberrant interval We call a set of intervals for a given sample a profile for that sample

The Problem Find locations which have more intervals (gains/losses) covering them than would be expected by chance True underlying aberration rate is unknown Take the observed aberrations as given and test for the significance of consistent aberrations across samples

Statistical Approach Null Model : observed intervals of aberration are equally likely to occur anywhere in the stretch of the genome being considered General Approach: (1) Choose an appropriate statistic (2) Apply a permutation procedure under the null model to estimate a null distribution of the statistic (3) Assess the (multiple testing corrected) significance of observed values of the statistic by comparing to the null distribution Permutation : random rearrangement of intervals within each profile

Frequency statistic results freq = 9 Need statistic sensitive to tight alignment, even if the aberration is not significantly frequent

The footprint statistic Stack : set S of aligned intervals containing at most one interval per profile and with at least one location common to all intervals Footprint: F(S) = the number of locations c such that c is contained in some interval of stack S In practice, F(S) is normalized: NF(S) = F(S)/E(F(S)) Null Distributions: Find the minimal NF(S) for each (sample) subset size using a heuristic search use distributions to assign (multiple testing corrected) p-values to locations (details omitted)

Footprint statistic results footprint statistic coupled with search strategy reveals locations significantly consistent within subsets p-value = p-value =

INPUT: matrix of binary gain/no change (or loss/no change) calls for each location along a chromosome arm OUTPUT: for each location along chromosome arm: a) the best stack covering that location b) two p-values for that location (one for each statistic) STAC Algorithm Specification Samplechr1: chr1: DZ1T DZ1T

Validation Data UPenn BAC Array (Greshock et al. 2004, Gen. Res.) ~4,200 BAC Clones 1. 69% BAC end sequenced 2. 28% STS Mapping 3. 3% Full BAC Sequence Spacing: ~0.91 Mb (chrs 1-X) aCGH BAC Coverage (chr13) Publicly available data sets: 42 Neuroblastoma cell lines (Mosse et al. 2005, Genes Chr Cancer) 47 Primary sporadic breast tumors (Naylor et al. 2005, submitted)

Traditional Processing – Many Samples 1. Define regions of aberration for each sample 2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

Traditional Processing – Many Samples 90% 70%90% 60% Example Common Regions of Aberration 1. Define regions of aberration for each sample 2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

Breast Cancer 92% (11/12) gain regions 85% (11/13) loss regions also 86% (47/55) of the gains (suppl. data) Avg pval gain: loss: Boundaries differ by < 1 Mb on average and in several cases are narrowed by STAC Validation Neuroblastoma 83% (19/22) gain regions 100% (12/12) loss regions Avg pval gain: loss: STAC identifies prognostically relevant regions in neuroblastoma. Shown: MYCN amplification at 2p24. 2p gain

Additional Regions Identified Neuroblastoma  94 Gains covering 341 Mb  80 Losses covering 305 Mb Neuroblastoma 94 Gains covering 341 Mb 80 Losses covering 305 Mb Breast Cancer  149 Gains covering 525 Mb  124 Losses covering 384 Mb

Regions segregate with known biology Neuroblastoma Cell Lines 646 Mb of significant locations scored (gain, loss, no change) Agglomerative hierarchical, Pearson correlation, complete linkage Evidence for 2 sample clusters - Cluster 1 characterized by pattern of loss - Cluster 2 characterized by pattern of gain * missed by traditional method

Future Plans Release stand alone Java version of STAC Extend STAC to account for high-level gains and homozygous deletions Extend STAC to account for high-level gains and homozygous deletions Extend STAC to handle stacks with 2 or more intervals per profile ( co-occurring aberrations ) Extend STAC to handle stacks with 2 or more intervals per profile ( co-occurring aberrations )