A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College Genomic studies and the HapMap March 15-18, 2005 Oxford, United Kingdom
Focal questions about the HapMap CEPH European samples 1. Required marker densityYoruban samples 4. How general the answers are to these questions among different human populations 2. How to quantify the strength of allelic association in genome region 3. How to choose tagging SNPs
Across samples from a single population? (random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)
Consequence for marker performance Markers selected based on the allele structure of the HapMap reference samples… … may not work well in another set of samples such as those used for a clinical study.
How to assess sample-to-sample variability? 1. Understanding intrinsic properties of a given genome region, e.g. estimating local recombination rate from the HapMap data 3. It would be a desirable alternative to generate such additional sets with computational means McVean et al. Science Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly
Towards a marker selection tool 2. generate computational samples for this genome region 3. test the performance of markers across consecutive sets of computational samples 1. select markers (tag SNPs) with standard methods
Generating additional computational haplotypes 1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population. 3. Use the second haplotype set induced by the same mutations as our computational samples. 4. In subsequent statistics, weight each such set proportional to the data likelihood calculated in Only accept the pair if the first set reproduces the observed haplotype structure of the HapMap reference samples. This enforces relevance to the observed genotype data in the specific region. Calculate the data likelihood (the probability that the genealogy does produce the observed haplotypes).
Generating computational samples Problem: The efficiency of generating data- relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem. N M We are develop a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K) Motivation from composite likelihood approaches to recombination rate estimation by Hudson, Clark, Wall, and others.
Approximating M-site haplotypes as composites of overlapping K-site haplotypes 1. generate K-site sets 2. build M-site composites M
Piecing together neighboring K-site sets hope that constraint at overlapping markers preserves for long-range marker association
Building composite haplotypes A composite haplotype is built from a complete path through the (M-K+1) K-sites.
Initial results: 3-site composite haplotypes a typical 3-site composite 30 CEPH HapMap reference individuals (60 chr) Hinds et al. Science, 2005
3-site composite vs. data
3-site composites: the “best case” “short-range” “long-range” 1. generate K-site sets
Variability across sets The purpose of the composite haplotypes sets … … is to model sample variance across consecutive data sets. But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.
4-site composite haplotypes 4-site composite
“Best-case” 4 site composites Composite of exact 4-site sub-haplotypes
Variability across 4-site composites
… is comparable to the variability across data sets.
Software engineering aspects: efficiency To do larger-scale testing we must first improve the efficiency of generating composite sets. Currently, we run fresh Coalescent runs at each K-site (several hours per region). Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Computational hap sets can be databased efficiently. 4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes < 200 Gigabytes
Technical/algorithmic improvements 3. dealing with uninformative markers 1. un-phased genotypes 2. markers with unknown ancestral state (AC)(CG)(AT)(CT) A G A C C C T T AC ? taking into account local recombination rate
Acknowledgements Eric Tsung Aaron Quinlan Ike Unsal Eva Czabarka (Dept. Mathematics, William & Mary)