Correlation Matrix Diagonal Segmentation (CMDS) A Fast Genome-wide Approach for Identifying Recurrent DNA Copy Number Alterations across Cancer Patients.

Slides:

Advertisements

Similar presentations

ICSA, 6/2007 Pei Wang, 1 Spatial Smoothing and Hot Spot Detection for CGH data using the Fused Lasso Pei Wang Cancer Prevention Research.

Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.

Bioinformatics lectures at Rice University Li Zhang Lecture 10: Networks and integrative genomic analysis-2 Genome instability and DNA copy number data.

Tumour karyotype Spectral karyotyping showing chromosomal aberrations in cancer cell lines.

Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.

Some slides adapted from J. Fridlyand BioSys course: DNA Microarray Analysis – Lecture, 2007 Analysis of Array CGH Data by Hanni Willenbrock.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara.

Algorithms for Smoothing Array CGH data

Differentially expressed genes

DNA Copy Number Analysis Qunyuan Zhang Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University School.

Comparative Genomic Hybridization (CGH). Outline Introduction to gene copy numbers and CGH technology DNA copy number alterations in breast cancer (Pollack.

Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.

Inferences About Process Quality

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

False Discovery Rates for Discrete Data Joseph F. Heyse Merck Research Laboratories Graybill Conference June 13, 2008.

Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.

We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.

Sequential Multiple Decision Procedures (SMDP) for Genome Scans Q.Y. Zhang and M.A. Province Division of Statistical Genomics Washington University School.

Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics.

____ __ __ _______Birol et al :: AGBT :: 7 February 2008 A NOVEL APPROACH TO IMPROVE THE NOISE IN DETECTING COPY NUMBER VARIATIONS USING OLIGONUCLEOTIDE.

Qunyuan Zhang Ingrid Borecki, Michael A. Province

Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Computational Laboratory: aCGH Data Analysis Feb. 4, 2011 Per Chia-Chin Wu.

Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.

Genomic Data Privacy Protection Using Compressive Sensing 1 »University of Oklahoma -Tulsa Aminmohammad Roozgard, Nafise Barzigar, Dr. Pramode Verma, Dr.

Ishida et al. Supplementary Figures 1-3 Page 1 Supplementary Fig. 1. Stepwise determination of genomic aberrations on chr-13 in medulloblastomas from Ptch1.

The International Consortium. The International HapMap Project.

CGH Data BIOS Chromosome Re-arrangements.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

URBDP 591 A Lecture 16: Research Validity and Replication Objectives Guidelines for Writing Final Paper Statistical Conclusion Validity Montecarlo Simulation/Randomization.

Copy Number Analysis in the Cancer Genome Using SNP Arrays Qunyuan Zhang, Aldi Kraja Division of Statistical Genomics Department of Genetics & Center for.

Jump to first page Inferring Sample Findings to the Population and Testing for Differences.

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,

STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.

A 2 veto for Continuous Wave Searches

Analysis of Variance and Covariance

A Genome-Wide High-Resolution Array-CGH Analysis of Cutaneous Melanoma and Comparison of Array-CGH to FISH in Diagnostic Evaluation Lu Wang, Mamta Rao,

Gene expression.

Fig. 8. Recurrent copy number amplification of BRD4 gene was observed across common cancers. Recurrent copy number amplification of BRD4 gene was observed.

Gene Dysregulations Driven by Somatic Copy Number Aberrations-Biological and Clinical Implications in Colon Tumors Manny D. Bacolod, Francis Barany

Mariëlle I. Gallegos Ruiz, MSc, Hester van Cruijsen, MD, Egbert F

CSCI2950-C Lecture 3 September 13, 2007.

Genomic alterations in breast cancer cell line MDA-MB-231.

Taichi Umeyama, Takashi Ito Cell Reports

Robust Detection of DNA Hypermethylation of ZNF154 as a Pan-Cancer Locus with in Silico Modeling for Blood-Based Diagnostic Development Gennady Margolin,

A Genome-Wide High-Resolution Array-CGH Analysis of Cutaneous Melanoma and Comparison of Array-CGH to FISH in Diagnostic Evaluation Lu Wang, Mamta Rao,

Cyclin E1 Is Amplified and Overexpressed in Osteosarcoma

Rapid Next-Generation Sequencing Method for Prediction of Prostate Cancer Risks Viacheslav Y. Fofanov, Kinnari Upadhyay, Alexander Pearlman, Johnny Loke,

Fixed, Random and Mixed effects

Volume 155, Issue 4, Pages (November 2013)

Chapter 10 Introduction to the Analysis of Variance

Gene Dysregulations Driven by Somatic Copy Number Aberrations-Biological and Clinical Implications in Colon Tumors Manny D. Bacolod, Francis Barany

Chapter 9 Hypothesis Testing: Single Population

Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.

Taichi Umeyama, Takashi Ito Cell Reports

Development of a Novel Next-Generation Sequencing Assay for Carrier Screening in Old Order Amish and Mennonite Populations of Pennsylvania Erin L. Crowgey,

Molecular characterization of esophagogastric tumors.

Presentation transcript:

Correlation Matrix Diagonal Segmentation (CMDS) A Fast Genome-wide Approach for Identifying Recurrent DNA Copy Number Alterations across Cancer Patients Qunyuan Zhang(1), Li Ding(2), Aldi Kraja(1) Ingrid Boreki(1), Michael A. Province(1) (1)Division of Statistical Genomics, (2)Genome Center Washington University School of Medicine, USA IGES, Sept. 2008, St. Louis 1

Introduction DNA copy number alteration (CNA) is one of the significant hallmarks of genomic abnormality in tumor cells. Identification of recurrent CNA (RCNA) across a cohort of cancer patients may provide an important insight into the molecular mechanism of oncogenesis and produce useful information for the diagnosis and treatment of cancers. Most of current methods for RCNA identification adopt a two-step strategy, which requires discretization (binarization, segmentation or incontinuous smoothing) for each individual sample’s data before searching RCNA regions across multiple samples. Although disretization provides useful CNA pattern or profile for individual samples, it may lose original distribution information when converting raw continuous signals into discretized data, and therefore may deteriorate the overall statistical power of RCNA detection. Besides, individual sample discretization, along with the following multiple sample analysis, may produce in total a heavy computational burden which could impedes the application, especially in the genome- wide studies with high density signals and large sample sizes. 2

Purpose To develop a fast genome-wide approach, Correlation Matrix Diagonal Segmentation (CMDS), for identifying recurrent DNA copy number alterations (RCNAs) in large scale genome-wide studies at the population level. The approach needs no data discretization for individual samples and directly analyzes the raw data of the entire samples. Here we present:  Statistical power (or receiver operating characteristic, ROC) of CMDS under a variety of configurations of multiple factors;  Comparison of statistical power and computational efficiency with existing typical discretization-based approach;  Application of CMDS to real data from the Tumor Sequencing Project (TSP). 3

The CMDS Approach (Rationale) Due to the copy number (CN) changes in the same chromosomal region across individuals (slide 6, fig a), RCNA causes co-variation (or correlations) between chromosomal sites within the recurrent region, and therefore diagonally forms a correlation block in the CN correlation matrix of chromosomal sites (slide 6, fig b). As each correlation block corresponds to a RCNA region, RCNA can be identified by detecting correlation blocks along the diagonal of correlation matrix. 4

1.Prepare copy number (log2 ratio) data as a n×m matrix (X). n=number of samples, m=number of chromosomal sites; (see slide 6, fig a) 2.Calculate Pearson’s correlation coefficients between chromosomal sites i and j ( r ij ); 3.Normalize r ij through Fisher’s transformation ( ) and obtain normalized correlation matrix (Z); (see slide 6, fig b) 4.Specify a small square block size b (e.g. b=10) and slide the block along the diagonal of matrix Z. For each block h, calculate: (see slide 6, fig c) 5.Under the null hypothesis that there is no CNA (i.e. no correlation between chromosomal sites), will randomly follow a normal distribution with a mean of 0 and a variance of. Based on this, p-value for each chromosomal block under the null hypothesis can be calculated and then used to determine the significance of RCNA regions. (see slide 6, fig d) The CMDS Approach (Procedure) 5

Illustration of CMDS a.Raw copy number data of 100 samples and 500 chromosomal sites (red denotes copy number higher than 2) b.Correlation matrix of 500 sites (white block indicates high correlation RCNA region) c.Diagonal transformed values d.Negative log10(P) values for the tests of 6 RCNA region

Factors Affecting the Power of CMDS 7 The statistical power of CMDS depends on multiple factors, including:  Block size (b) chosen for diagonal transformation  Sample size (n)  Frequency of RCNA among population (f)  Amplitude (i.e. copy number) of RCNA region (c)  Total number of chromosomal sites (m) involved in analysis  Number of sites within RCNA region (t)

Expected and Observed Type I errors Result is based on 1000 replications of simulation (b=20,n=50,f=0.1,c=3,m=5000, t=50) Conclusion: the P value calculation in CMDS is very close to the expected, which allows a quick test without using re-sampling or permutation technique. 8

ROC Curves of CMDS Under Multiple Configurations 9 Simulation parameters: a)n=50,f=0.1,c=3,m=1000,t=10~ 50(random) b)b=20,f=0.1,c=3,m=1000,t=30 c)b=20,n=50,c=3,m=1000,t=30 d)b=20,n=50,f=0.1,m=1000,t=30 e)b=20,n=50,f=0.1,c=3,m=1000 f)b=20,n=50,f=0.1,c=3,t=30 Results are based on 500 replications of Simulation TPR: ture positive rate; FPR: false positive rate

Comparison with Other Approach The figure above shows the ROC curves of CMDS and a typical discretization-based approach, STAC (Diskin et al.,2006). Before STAC analysis, GLAD (Hupe et al., 2004) was used to smooth and discretize individual sample data. Result is based on 500 replications of simulation (b=20; n=50,f=0.1,c=3~4,m=300,t=30) 10 Power Computer Time GLAD-STAC: 2820 seconds (47 min) CMDS: 15 seconds Comparison was performed on DELL OPTIPLEX 755 PC. Both GLAD and CMSD were implemented in R 2.5.1, STAC (permutation number= 10000) was run in JAVA (under Windows XP 5.1). The same data set was used (containing chromosomal sites and 100 samples). In GLAD-STAC analysis, most time was spent by GLAD. Conclusion: Compared with discretization-based approach, CMDS can obtain higher power with much smaller computer burden.

Application of CMDS 11 We apply CMDS to a real data set from the NHGRI Tumor Sequencing Project (TSP), which contains the DNA copy number data of tumor tissues from 371 lung cancer (adenocarcinoma) patients, measured by the Affymetrix Human Mapping 250K STY SNP array. This data set has been analyzed using another discretization-based method (GISTIC) and published elsewhere (Weir et al., 2007). It is now publicly available at Our results show that CMDS can identify most of the interesting, important regions that have been reported previously, as well as some novel, unreported regions. (see slides 12~15)

12 CMDS Analysis of TSP Data (1) EGFR MYC Reported regions with interesting candidate oncogenes

13 CCND1 KRAS CMDS Analysis of TSP Data (2) Reported regions with interesting candidate oncogenes

14 CDK4 NKX2-1,MBIP CMDS Analysis of TSP Data (3) Reported regions with interesting candidate oncogenes

15 CMDS Analysis of TSP Data (4) Unreported novel regions

Summary 16  CMDS directly analyses raw copy number (log2 ratio) data at the population level;  CMDS needs no discretization of individual sample data and adopts an easily implemented and fast diagonal transformation technique, which substantially reduces the computer burden;  CMDS exploits correlation information between chromosomal sites, which increases the statistical power of the RCNA identification;  CMDS is particularly suitable for the quick search of RCNA regions through genome- wide data from large population;  The R code for CMDS analysis (test version, unpublished) can be obtained by Qunyuan Zhang

References 17 1.Diskin S J et al. (2006) STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Research, 16:1149– Hupé P et al. (2004) Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20:3413– Shah S P et al. (2007) Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics, 23:450– Weir B A et al. (2007) Characterizing the cancer genome in lung adenocarcinoma. Nature, 450: