CGH Data BIOS
Chromosome Re-arrangements
Normal Human Variation
Array CGH Technology
Chromosome 8 (241 genes) in 10 cell lines and many tumor samples
Pre-processing CGHa Data QA: Same as for expression Normalization –Are values comparable across arrays? –Can noise be reduced? Segmentation –Where do copy number aberrations start and stop? –Better estimates for how many copies
Normalization Most copy numbers are 2 Centering necessary Dynamic range varies –Mixtures of tumor with normal Saturation not usually a problem –Few instances of 10X copy Dye bias sometimes strong –loess procedure unreliable
Centering Where is the center (log ratio 0)? Sometimes modal copy number is 3 –Variability in labeling and tissue extraction –CGH can’t give direct measures of counts Most researchers set modal copy to log- ratio of 0 Does it matter? –Take 3 as equivalent to 2 for comparison?
Dynamic Range Ratios of signal are often less (sometimes much less) than actual ratios of copy numbers between samples From Bilke et al, Bioinformatics, 2005
Fractional Copy Numbers Often samples are mixtures of tumor and normal Many tumors have two (or more) distinct clones with distinct karyotypes Observed copy numbers may lie in between values corresponding to whole numbers
Probe Bias If errors are random then plot of self vs self ratios should be random Actual Corr > 60% Clear bias! Try to estimate it
Segmentation Individual probe values are noisy Most aberrations are segments Most segments have many probes Average neighboring probe values to better estimate segment value – how far?
Segmentation Issues: 1.How to identify where a segment starts or stops 2.How to find these points efficiently
Noise and Signal
How to Find Segments? Could be large copy number change over short interval or small change over large Look for jumps in running averages Distribution of jumps between probes DNACopy is Maximum Likelihood estimate of change points, using all intervals StepGram is efficient computation of (subset of) t-scores
Theory Classical change-point test statistic –Let be values; let be partial sums –Set, where –are the differences in levels before and after i Now for segments ‘in middle’ –Let, where This is “Circular Binary Segmentation” Implemented in DNACopy
DNACopy In Bioconductor Does ML identification of segments recursively –Apply procedure within identified segments Double-checks points near the boundary Does permutation testing to estimate null distribution –Often data are not Normal
StepGram DNACopy is slow! Could try to compute only a fraction of possible scores StepGram tries to find a subset of most likely scores to compute Much faster! Some inaccuracies Doesn’t handle chromosome ends well
StepGram – Method 1 Key Idea: Don’t compute all possible t-scores Compute only those likely to show significant change Bound the estimated t-scores in future based on current t-scores
StepGram – Algorithm 2