A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
Recombination and genetic variation – models and inference
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Signatures of Selection
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Polymorphism Structure of the Human Genome Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Lecture X.X1. 2 The informatics of SNPs and Haplotypes Gabor T. Marth Department of Biology, Boston College
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
SNP Selection University of Louisville Center for Genetics and Molecular Medicine January 10, 2008 Dana Crawford, PhD Vanderbilt University Center for.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology
Medical variations Gabor T. Marth Boston College Biology Department BI543 Fall 2013 February 5, 2013.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Genome-Wide Association Study (GWAS)
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Recombination Mapping SNP mapping
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
The HapMap Project and Haploview
The International Consortium. The International HapMap Project.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Common variation, GWAS & PLINK
Gil McVean Department of Statistics
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
Signatures of Selection
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Patterns of Linkage Disequilibrium in the Human Genome
Discovery tools for human genetic variations
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Incorporating changing population size into the coalescent
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Research for medical discovery at the Computational Genomics Laboratory at Boston College Biology Gabor T. Marth Department of Biology, Boston College.
Presentation transcript:

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College MIT Bioinformatics Seminar April 25, 2005 Cambridge, Massachusetts, USA

Sequence variations cause inherited diseases allow tracking ancestral human history

Automated polymorphism discovery tools Marth et al. Nature Genetics 1999

Genome scale sequence variation resources Sachidanandam et al. Nature 2001 ~ 10 million EST WGS BAC genome reference

How to use markers to find disease? problem: genotyping cost precludes using millions of markers simultaneously for an association study genome-wide, dense SNP marker map depends on the patterns of allelic association in the human genome question: how to select from all available markers a subset that captures most mapping information (marker selection, marker prioritization)

Allelic association allelic association is the non- random assortment between alleles i.e. it measures how well knowledge of the allele state at one site permits prediction at another site marker site functional site by necessity, the strength of allelic association is measured between markers significant allelic association between a marker and a functional site permits localization (mapping) even without having the functional site in our collection there are pair-wise and multi-locus measures of association

Pair-wise: linkage disequilibrium (LD) LD measures the deviation from random assortment of the alleles at a pair of polymorphic sites D=f( ) – f( ) x f( ) other measures of LD are derived from D, by e.g. normalizing according to allele frequencies (r 2 )

strong association: most chromosomes carry one of a few common haplotypes – reduced haplotype diversity Multi-marker: haplotype diversity the most useful multi-marker measures of associations are related to haplotype diversity 2 n possible haplotypesn markers random assortment of alleles at different sites

The main determinants of allelic association recombination: breaks down allelic association by “randomizing” allele combinations demographic history of effective population size: bottlenecks increase allelic association by non-uniform re- sampling of allele combinations (haplotypes) bottleneck

Block-like patterns in the human genome Wall & Pritchard Nature Rev Gen 2003 Daly et al. Nature Genetics 2001

The promise for medical genetics CACTACCGA CACGACTAT TTGGCGTAT within blocks a small number of SNPs are sufficient to distinguish the few common haplotypes  significant marker reduction is possible if the block structure is a general feature of human variation structure, whole-genome association studies will be possible at a reduced genotyping cost this motivated the HapMap project Gibbs et al. Nature 2003

The HapMap initiative goal: to map out human allele and association structure of at the kilobase scale deliverables: a set of physical and informational reagents

HapMap physical reagents reference samples: 4 world populations, ~100 independent chromosomes from each markers: millions of SNPs from the US public variation database (dbSNP) genotypes: high-accuracy assays using various platforms; fast public data release

Informational: HapMap annotations (“HaploView”, Daly lab)

Focal questions about the HapMap CEPH European samples 1. Required marker densityYoruban samples 4. How general the answers are to these questions among different human populations 2. How to quantify the strength of allelic association in genome region 3. How to choose tagging SNPs

Across samples from a single population? (random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)

Consequence for marker performance Markers selected based on the allele structure of the HapMap reference samples… … may not work well in another set of samples such as those used for a clinical study.

How to assess sample-to-sample variability? 1. Understanding intrinsic properties of a given genome region, e.g. estimating local recombination rate from the HapMap data 3. It would be a desirable alternative to generate such additional sets with computational means McVean et al. Science Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly

Towards a marker selection tool 2. generate computational samples for this genome region 3. test the performance of markers across consecutive sets of computational samples 1. select markers (tag SNPs) with standard methods

Modeling variations: the Coalescent past the Coalescent is a simulation technique produces possible genealogies backwards, towards MRCA generates mutations (neutral, non-recurrent) used to describe the statistical properties of DNA samples these statistical properties depend on model structure and model parameters

1. marker density (MD): distribution of number of SNPs in pairs of sequences Two simple statistics “rare” “common” 2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples

The effects of demographic history past present stationaryexpansioncollapse MD (simulation) AFS (direct form) history bottleneck

Generating haplotypes with the Coalescent

Global haplotypes vs. local data relevance

Generating data-relevant haplotypes 1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population. 3. Use the second haplotype set induced by the same mutations as our computational samples. 2. Only accept the pair if the first set reproduces the observed haplotype structure of the HapMap reference samples. This enforces relevance to the observed genotype data in the specific region.

Generating computational samples Problem: The efficiency of generating data- relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem. N M We are developing a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K) Motivation from composite likelihood approaches to recombination rate estimation by Hudson, Clark, Wall, and others.

Approximating M-site haplotypes as composites of overlapping K-site haplotypes 1. generate K-site sets 2. build M-site composites M

Piecing together neighboring K-site sets this should work to the degree to which the constraint at overlapping markers preserves long-range marker association

Building composite haplotypes A composite haplotype is built from a complete path through the (M-K+1) K-sites.

Initial results: 3-site composite haplotypes a typical 3-site composite 30 CEPH HapMap reference individuals (60 chr) Hinds et al. Science, 2005

3-site composite vs. data

3-site composites: the “best case” “short-range” “long-range” 1. generate K-site sets

Variability across sets The purpose of the composite haplotypes sets … … is to model sample variance across consecutive data sets. But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.

4-site composite haplotypes 4-site composite

“Best-case” 4 site composites Composite of exact 4-site sub-haplotypes

Variability across 4-site composites

… is comparable to the variability across data sets.

Software engineering aspects: efficiency To do larger-scale testing we must first improve the efficiency of generating composite sets. Currently, we run fresh Coalescent runs at each K-site (several hours per region). Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Computational hap sets can be databased efficiently. 4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes < 200 Gigabytes

Un-phased genotypes (AC)(CG)(AT)(CT) A G A C C C T T the primary data represent diploid genotypes one has the choice to “reconstruct” the haplotypes with statistical methods as shown (e.g. the PHASE program); this may be inaccurate or one may account for all possible reconstructions when evaluating data-relevance; this is computationally very expensive

Conclusions 3-site composites are unlikely to work 4-site composites are very promising both the initial results and the expected payoff justify going ahead more thorough statistical analyses, performance evaluations, and algorithmic development work ahead

Acknowledgements Eric Tsung Aaron Quinlan Ike Unsal Eva Czabarka (Dept. Mathematics, William & Mary)