CGH Data BIOS 691-804. Chromosome Re-arrangements.

Slides:



Advertisements
Similar presentations
Probability models- the Normal especially.
Advertisements

Sampling: Final and Initial Sample Size Determination
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Multivariate Analysis of Pathways. Multivariate Approaches to Gene Set Selection.
Microarray Normalization
Hypothesis testing Week 10 Lecture 2.
Tumour karyotype Spectral karyotyping showing chromosomal aberrations in cancer cell lines.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genomic Arrays: Tools for cancer gene discovery Ian Roberts MRC Cancer Cell Unit Hutchison MRC Research Centre
T-tests Computing a t-test  the t statistic  the t distribution Measures of Effect Size  Confidence Intervals  Cohen’s d.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Error Propagation. Uncertainty Uncertainty reflects the knowledge that a measured value is related to the mean. Probable error is the range from the mean.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara.
Algorithms for Smoothing Array CGH data
Differentially expressed genes
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
Evaluating Hypotheses
Lecture 9: One Way ANOVA Between Subjects
1 Test of significance for small samples Javier Cabrera.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Experimental Evaluation
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Hypothesis Tests and Confidence Intervals in Multiple Regressors
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Chapter 11: Estimation Estimation Defined Confidence Levels
CDNA Microarrays MB206.
Sections 6-1 and 6-2 Overview Estimating a Population Proportion.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Random Sampling, Point Estimation and Maximum Likelihood.
Panu Somervuo, March 19, cDNA microarrays.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Statistics. Key statistics and their purposes Chi squared test: determines if a data set is random or accounted for by an unwanted variable Standard deviation:
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Statistics for Differential Expression Naomi Altman Oct. 06.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Computational Laboratory: aCGH Data Analysis Feb. 4, 2011 Per Chia-Chin Wu.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
For starters - pick up the file pebmass.PDW from the H:Drive. Put it on your G:/Drive and open this sheet in PsiPlot.
1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.
Statistics for Political Science Levin and Fox Chapter Seven
Machine Learning 5. Parametric Methods.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Cell Diameters and Normal Distribution. Frequency Distributions a frequency distribution is an arrangement of the values that one or more variables take.
Inference about the slope parameter and correlation
Mixture Modeling of the p-value Distribution
Applied Statistical Analysis
Sampling Distribution
Sampling Distribution
Chapter 10 Introduction to the Analysis of Variance
Presentation transcript:

CGH Data BIOS

Chromosome Re-arrangements

Normal Human Variation

Array CGH Technology

Chromosome 8 (241 genes) in 10 cell lines and many tumor samples

Pre-processing CGHa Data QA: Same as for expression Normalization –Are values comparable across arrays? –Can noise be reduced? Segmentation –Where do copy number aberrations start and stop? –Better estimates for how many copies

Normalization Most copy numbers are 2 Centering necessary Dynamic range varies –Mixtures of tumor with normal Saturation not usually a problem –Few instances of 10X copy Dye bias sometimes strong –loess procedure unreliable

Centering Where is the center (log ratio 0)? Sometimes modal copy number is 3 –Variability in labeling and tissue extraction –CGH can’t give direct measures of counts Most researchers set modal copy to log- ratio of 0 Does it matter? –Take 3 as equivalent to 2 for comparison?

Dynamic Range Ratios of signal are often less (sometimes much less) than actual ratios of copy numbers between samples From Bilke et al, Bioinformatics, 2005

Fractional Copy Numbers Often samples are mixtures of tumor and normal Many tumors have two (or more) distinct clones with distinct karyotypes Observed copy numbers may lie in between values corresponding to whole numbers

Probe Bias If errors are random then plot of self vs self ratios should be random Actual Corr > 60% Clear bias! Try to estimate it

Segmentation Individual probe values are noisy Most aberrations are segments Most segments have many probes Average neighboring probe values to better estimate segment value – how far?

Segmentation Issues: 1.How to identify where a segment starts or stops 2.How to find these points efficiently

Noise and Signal

How to Find Segments? Could be large copy number change over short interval or small change over large Look for jumps in running averages Distribution of jumps between probes DNACopy is Maximum Likelihood estimate of change points, using all intervals StepGram is efficient computation of (subset of) t-scores

Theory Classical change-point test statistic –Let be values; let be partial sums –Set, where –are the differences in levels before and after i Now for segments ‘in middle’ –Let, where This is “Circular Binary Segmentation” Implemented in DNACopy

DNACopy In Bioconductor Does ML identification of segments recursively –Apply procedure within identified segments Double-checks points near the boundary Does permutation testing to estimate null distribution –Often data are not Normal

StepGram DNACopy is slow! Could try to compute only a fraction of possible scores StepGram tries to find a subset of most likely scores to compute Much faster! Some inaccuracies Doesn’t handle chromosome ends well

StepGram – Method 1 Key Idea: Don’t compute all possible t-scores Compute only those likely to show significant change Bound the estimated t-scores in future based on current t-scores

StepGram – Algorithm 2