Some slides adapted from J. Fridlyand BioSys course: DNA Microarray Analysis – Lecture, 2007 Analysis of Array CGH Data by Hanni Willenbrock.

Slides:

Advertisements

Similar presentations

ICSA, 6/2007 Pei Wang, 1 Spatial Smoothing and Hot Spot Detection for CGH data using the Fused Lasso Pei Wang Cancer Prevention Research.

Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.

We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.

Bioinformatics lectures at Rice University Li Zhang Lecture 10: Networks and integrative genomic analysis-2 Genome instability and DNA copy number data.

Methods for copy number variation: hidden Markov model and change- point models.

SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative gnome hybridization data.

Tumour karyotype Spectral karyotyping showing chromosomal aberrations in cancer cell lines.

Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.

Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.

Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.

Genomic Arrays: Tools for cancer gene discovery Ian Roberts MRC Cancer Cell Unit Hutchison MRC Research Centre

STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara.

Algorithms for Smoothing Array CGH data

Differentially expressed genes

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Comparative Genomic Hybridization (CGH). Outline Introduction to gene copy numbers and CGH technology DNA copy number alterations in breast cancer (Pollack.

Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.

Analysis of microarray data

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.

Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,

Large-Scale Copy Number Polymorphism in the Human Genome J. Sebat et al. Science, 305:525 Luana Ávila MedG 505 Feb. 24 th /24.

Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.

Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.

Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

CDNA Microarrays MB206.

Microarrays and Their Uses Brad Windle, Ph.D

Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.

Scenario 6 Distinguishing different types of leukemia to target treatment.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

CpSc 810: Machine Learning Evaluation of Classifier.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Critical Assessment.

Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.

Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Correlation Matrix Diagonal Segmentation (CMDS) A Fast Genome-wide Approach for Identifying Recurrent DNA Copy Number Alterations across Cancer Patients.

Computational Laboratory: aCGH Data Analysis Feb. 4, 2011 Per Chia-Chin Wu.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.

Cluster validation Integration ICES Bioinformatics.

Comp. Genomics Recitation 10 4/7/09 Differential expression detection.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

The Broad Institute of MIT and Harvard Differential Analysis.

CGH Data BIOS Chromosome Re-arrangements.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Copy Number Analysis in the Cancer Genome Using SNP Arrays Qunyuan Zhang, Aldi Kraja Division of Statistical Genomics Department of Genetics & Center for.

Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).

Transcriptional heterogeneity of breast cancer subtypes,

Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.

Some slides adapted from J. Fridlyand

Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.

Mariëlle I. Gallegos Ruiz, MSc, Hester van Cruijsen, MD, Egbert F

Microarray Techniques to Analyze Copy-Number Alterations in Genomic DNA: Array Comparative Genomic Hybridization and Single-Nucleotide Polymorphism Array

Chapter 7: Sampling Distributions

Defining Ploidy-Specific Thresholds in Array Comparative Genomic Hybridization to Improve the Sensitivity of Detection of Single Copy Alterations in Cell.

Presentation transcript:

Some slides adapted from J. Fridlyand BioSys course: DNA Microarray Analysis – Lecture, 2007 Analysis of Array CGH Data by Hanni Willenbrock

1 Outline Introduction to comparative genomic hybridization (CGH) and array CGH Data analysis approaches -Breakpoint detection -Loss and gain analysis -Application of segmentation to testing Real data example 1: Application to a primary tumor dataset Real data example 2: comparative genomic profiling of bacterial strains

PhD defense, October 27th Comparative Genomic Hybridization Study types : -Gain or loss of genetic material -To find variations in the genetic material Purposes: -Study of chromosomal aberrations often found in cancer and developmental abnormalities. -Study of variations in the baseline sequence in a microbial population (microbial comparative genomics).

3 A Variety of Genetic Alterations Underlie Developmental Abnormalities and Disease Any of the above may lead to an oncogene activation or to inactivation of a tumor suppressor. Inappropriate gene activation or inactivation can be caused by: -Mutation -Epigenetic gene silencing (e.g. addition of methyl groups) -Reciprocal translocation (exchange of fragments between two non-homologous chromosomes) -Gain or loss of genetic material

4 Existing techniques for detecting structural abnormalities Albertson and Pinkel, Human Molecular Genetics, 2003

5 Some microarray platforms for copy number analysis BAC arrays Affymetrix SNP chip (500 K) Representational oligonucleotide microarray analysis (ROMA) in Whole genome tiling arrays Own design (NimbleGen/NimbleExpress)

6 Array CGH: BAC arrays 12 mm HumArray human BAC clones spotted in triplicates kbp

7 Array CGH Maps DNA Copy Number Alterations to Positions in the Genome Loss of DNA copies in tumor Gain of DNA copies in tumor Ratio Position on Sequence Cot-1 DNA Test Genomic DNAReference Genomic DNA

8 Example: Detection of DiGeorge region (A) Detection of deletion in the DiGeorge region by FISH. A chromosome 22 subtelomere probe (green) and the TUPLE1 probe for the DiGeorge region (red) were hybridized to metaphase chromosomes from a normal individual and an individual with the deletion. The arrow indicates the missing red FISH signal on the deleted chromosome. (B) Array CGH copy number profile of chromosome 22 showing deletion in the DiGeorge region (arrow). Albertson and Pinkel, Human Molecular Genetics, 2003

9 Structural abnormalities Albertson and Pinkel, Human Molecular Genetics, 2003 *HSR: homogeneously staining region *

10 Tumor Genomes are Stable Copy Number Profiles of a Tumor & Recurrence

11 Analysis of array CGH Goal: To partition the clones into sets with the same copy number and to characterize the genomic segments in terms of copy number. Biological model: genomic rearrangements lead to gains or losses of sizable contiguous parts of the genome, possibly spanning entire chromosomes, or, alternatively, to focal high-level amplifications.

12 Varying genomic complexity Breakpoints

Exercise Part I: Plot and view array CGH data DNA Microarray Analysis Course,

DNA Microarray Analysis Course, Observed clone value and spatial coherence N(-.3,.08^2) N(.6,.1^2) ? ? Useful to make use of the physical dependence of the nearby clones, which translates into copy number dependence.

15 Expected log 2 ratio as a function of copy number change, normal cell contamination and ploidy Reference ploidy= % 10% Reference ploidy=3 50%

16 Simulation Study Many algorithms to choose from Mainly evaluated only on limited examples Few comparisons between algorithm performance Choice of evaluation criteria: -False breakpoint detection vs. missed breakpoints -Sample type preferences (size of segments, noise, etc)

17 Methods for Segmentation HMM: Hidden Markov Model (aCGH package) Fit HMMs in which any state is reachable from any other state (Fridlyand et al, JMVA, 2004). CBS: Circular binary segmentation (DNAcopy package) Tertiary splits of the chromosomes into contiguous regions of equal copy number and assesses significance of the proposed splits by using a permutation reference distribution (Olshen et al, Biostatistics, 2004). GLAD: Gain and Loss Analysis of DNA (GLAD package) Detects chromosomal breakpoints by estimating a piecewise constant function that is based on adaptive weights smoothing ( Hupe et al, Bioinformatics, 2004).

18 Comparison Scheme Use of simulated data, where the truth is known The noise is controlled (see later slide) True breakpoint false predicted breakpoint One segment

19 Breakpoint Detection Accuracy

Exercise Part II: Segmentation and breakpoint prediction DNA Microarray Analysis Course,

DNA Microarray Analysis Course, Merging segments Note: that all procedures operate on individual chromosomes, therefore resulting in a large number of segments with mean values close to each other. Additional Challenge: reduce number of segments by merging the ones that are likely to correspond to the same copy number. This will facilitate inference of altered regions.

DNA Microarray Analysis Course, Merging For estimating actual copy number levels from segmentations

DNA Microarray Analysis Course, Segmentation and Merging

24 ROC Curve: Identification of copy number alterations for varying thresholds

Exercise Part III: Estimate copy number gain and losses DNA Microarray Analysis Course,

26 Using segmentation for testing (phenotype association studies) Example case: Find clones (or whole segments) that are significantly differing in copy number between two cancer subtypes. Task: Investigate whether incorporating spatial information (segmentation) into testing for differential copy number increases detection power. Data type: Samples with either of 2 different phenotypes (e.g. 2 different cancer subtypes) How: Comparison of sensitivity and specificity using: 1. Original test statistic (no use of spatial information) 2. Segmented T-statistic derived from original log 2 ratios 3. T-statistic computed from segmented log 2 ratios

27 Simulation of Array CGH Data Real biological variation considered: Breast cancer data used as model data Segment length and copy number is taken from the empirical distribution observed in breast cancer data (DNAcopy segmentation). Mixture of cells (sample is not pure) Each sample was assigned a value, P t : proportion of tumor cells, between 0.3 and 0.7 from a uniform distribution. Experimental noise is Gaussian Standard deviations drawn from a uniform distribution between 0.1 and 0.2 to imitate real data where the noise may vary between experiments. Cancer subtypes are heterogeneous Certain aberrations characteristic for a cancer subtype may only exist in a percentage of the patients with that cancer subtype. Thus, in each sample, segments with copy number alterations (copy number not 2) was removed at random with probability 30%.

28 Testing samples (original values) 20 samples from either of 2 classes, red is true copy number, black dots are simulated values, circles around example of heterogeneity x9 x % 57.0%

29 Testing samples (original values) Red: True different clones

30 Testing: why is multiple testing necessary? standard p-value cutoff for alpha=0.05 => Many false positives

31 Testing: why is multiple testing necessary? (maximum deviating value) Significance with random class assignments? By chance, many test statistics are below/above standard significance thresholds

32 The maxT Multiple Testing Correction By repeating random class assigningment and testing, e.g. 100 times, the following ”permutation reference distribution” of maximum absolute test statistic is obtained (maxT distribution): We wish to control the family wise error rate (FWER) at alpha=0.05 (5% chance of 1 false positive). Therefore, the cut-off should be such that only in 5% of the random cases, we will get one false positive (95 percentile): cutoff = 5 standard significance threshold MaxT multiple testing corrected threshold

33 Testing samples (original values) standard p-value cutoff for alpha=0.05 maxT p- value cutoff for alpha = 0.05

34 Testing: Segmenting test statistics Reference

35 Testing segmented samples Segmentation of individual samples...

36 Testing segmented samples 2. T-statistic from segmented individual samples... Reference

37 Detecting regions with differential copy number Willenbrock and Fridlyand. Bioinformatics 2005; 21(22):

38 Variation of Simulation Parameters Signal2noise -CBS consistently the best performance -HMM has the highest FDR -GLAD is least sensitive Alternative empirical distributions of segment lengths -HMM has highest sensitivity for segment sizes below 10 -CBS has highest sensitivity for segment sizes 10 or larger -GLAD consistently performes the worst Outlier detection

39 Real Data Example 1: Primary Tumor Data 75 oral squamous cell carcinomas (SCCs) TP53 mutational status of all samples was determined using sequence information (Snijders et al., 2005) Tasks: -Characterize wild-type and mutant samples with respect to their genomic alterations -Build a classifier to predict TP53 mutational status

40 Frequency of Gain/Loss Comparisons Threshold-basedMerge-based 5% altered33% altered

41 Why such a difference in alteration frequency? High threshold-based cut-off is due to the high experimental noise of the paraffin-embedded tumors + 2.5x MAD - 2.5x MAD Willenbrock and Fridlyand. Bioinformatics 2005; 21(22):

42 Classification results Willenbrock and Fridlyand. Bioinformatics 2005; 21(22):

43 Real Data Example 2: Comparative genomic profiling of several Escherichia coli strains The microarray design included probes for: -7 known E. coli strains -39 known E. coli bacteriophages -104 known E. coli virulence genes Experimentally: -2 sequenced control strains (W3110 and EDL933), 3 replicates -2 non-sequenced strains (D1 and 3538), 3 replicates -Bacteriophage:  3538 (  stx2::cat), 2 replicates

44 Comparative Genomic Profiling: challenges Ratio problems: some genes might be present on query strain but not on the known reference strain. Single channel microarrays or dual channel microarrays? -In this case, we used an Affymetrix single channel custom made array (NimbleExpress) Partly present genes versus similar but different genes.

45 Homology between the 7 E. coli strains included on the microarray Very high similarity between the two K- 12 strains and between the two O157:H7 strains. Percentage of homologues for E. coli genomes in columns found in E. coli genomes in rows. Willenbrock et al. Journal of Bacteriology Nov;188(22):

46 BLAST Atlas Willenbrock et al. Journal of Bacteriology Nov;188(22):

47 Hybridization Atlases Probe hybridizations for experiments (samples) result in a similar pattern as expected from the BLAST atlas. Willenbrock et al. Journal of Bacteriology Nov;188(22):

48 Mapping the phage Φ3538 (  stx2::cat) Willenbrock et al. Journal of Bacteriology Nov;188(22):

49 Zoom of phage Φ3538 (  stx2::cat) The hybridization pattern is very similar for the phage, strain 3538 and strain D1. Willenbrock et al. Journal of Bacteriology Nov;188(22):

Hierarchical Cluster Analysis D1 is very similar to the K-12 type strains (W MG1655). K-12

51 E. coli virulence genes D1 is probably still a commensal strain (An organism participating in a symbiotic relationship from which it benefits while the other is unaffected). Willenbrock et al. Journal of Bacteriology Nov;188(22):

52 Summary Comparative genomic profiling of two E. coli strains -0175:H16 D :H Identification of virulence genes and phage elements Conclusions: D1 is similar to the K-12 type strains Characterization of D1 and 3538 genes: -Identification of a number of genes involved in DNA transfer and recombination

53 Advantages over Conventional Expression Arrays 1. Hybridization of DNA to microarray (DNA is much more stable) 2. Little normalization is necessary 3. Use of spatial coherence in the analysis 4. Only 1 sample is necessary to draw conclusions (it is still necessary with biological replicates to be able to draw general conclusions regarding a certain biological subtype) 5. Results may be easier interpretable and correlated with sample phenotypes (e.g. loss of oncogene repressor -> certain cancer subtype)

54 Summary Numerous methods have been introduced for segmentation of DNA copy number data and breakpoint identification. It is important to benchmark them against existing methods (however, only feasible if the software is publicly available) Currently, CBS (DNAcopy package) has the best overall performance Use of spatial dependency in the analysis improves testing power on clone-by-clone basis Merging of segmentation results improves copy number phenotype characterization Study types: -Study of copy number in cancer samples -Study of samples from patients with mental diseases -Comparison of bacterial strains

Questions? Exercise Part IV + Bonus exercise: Real data analysis DNA Microarray Analysis Course,