Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.

Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June 25, 2007

Lecture 7.02 Why do we care about variations? underlie phenotypic differences cause inherited diseases allow tracking ancestral human history

Lecture 7.03 How do we find sequence variations? look at multiple sequences from the same genome region use base quality values to decide if mismatches are true polymorphisms or sequencing errors

Lecture 7.04 Steps of SNP discovery Sequence clustering Cluster refinement Multiple alignment SNP detection

Lecture 7.05 Computational SNP mining – PolyBayes 2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing errortrue polymorphism 1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources Two innovative ideas:

Lecture 7.06 SNP mining steps – PolyBayes sequence clustering simplifies to database search with genome reference paralog filtering by counting mismatches weighed by quality values multiple alignment by anchoring fragments to genome reference SNP detection by differentiating true polymorphism from sequencing error using quality values

Lecture 7.07 genome reference sequence 1. Fragment recruitment (database search) 2. Anchored alignment 3. Paralog identification 4. SNP detection SNP discovery with PolyBayes

Lecture 7.08 Polymorphism discovery SW Marth et al. Nature Genetics 1999

Lecture 7.09 Genotyping by sequence SNP discovery usually deals with single-stranded (clonal) sequences It is often necessary to determine the allele state of individuals at known polymorphic locations Genotyping usually involves double-stranded DNA  the possibility of heterozygosity exists there is no unique underlying nucleotide, no meaningful base quality value, hence statistical methods of SNP discovery do not apply

Lecture 7.010 Het detection = Diploid base calling Homozygous T Homozygous C Heterozygous C/T Automated detection of heterozygous positions in diploid individual samples

Lecture 7.011 Large SNP mining projects Sachidanandam et al. Nature 2001 ~ 8 million EST WGS BAC genome reference

Lecture 7.012 Variation structure is heterogeneous chromosomal averages polymorphism density along chromosomes

Lecture 7.013 What explains nucleotide diversity? G+C nucleotide content CpG di-nucleotide content recombination rate functional constraints 3’ UTR5.00 x 10 -4 5’ UTR4.95 x 10 -4 Exon, overall4.20 x 10 -4 Exon, coding3.77 x 10 -4 synonymous 366 / 653 non-synonymous287 / 653 Variance is so high that these quantities are poor predictors of nucleotide diversity in local regions hence random processes are likely to govern the basic shape of the genome variation landscape  (random) genetic drift

Lecture 7.014 Where do variations come from? sequence variations are the result of mutation events TAAAAAT TAACAAT TAAAAAT TAACAAT TAAAAATTAACAAT TAAAAAT MRCA mutations are propagated down through generations and determine present-day variation patterns

Lecture 7.015 Neutrality vs. selection selective mutations influence the genealogy itself; in the case of neutral mutations the processes of mutation and genealogy are decoupled functional constraints 3’ UTR5.00 x 10 -4 5’ UTR4.95 x 10 -4 Exon, overall4.20 x 10 -4 Exon, coding3.77 x 10 -4 synonymous 366 / 653 non-synonymous287 / 653 the genome shows signals of selection but on the genome scale, neutral effects dominate

Lecture 7.016 Mutation rate accgttatgtaga accgctatgtaga MRCA actgttatgtaga accgctatataga MRCA higher mutation rate (µ) gives rise to more SNPS there is evidence for regional differences in observed mutation rates in the genome CpG content SNP density

Lecture 7.017 Long-term demography small (effective) population size N large (effective) population size N different world populations have varying long-term effective population sizes (e.g. African N is larger than European)

Lecture 7.018 Population subdivision unique shared geographically subdivided populations will have differences between their respective variation structures

Lecture 7.019 Recombination acggttatgtaga accgttatgtaga acggttatgtaga accgttatgtaga acggttatgtaga

Lecture 7.020 Recombination acggttatgtaga accgttatgtaga acggttatgtaga accgttatgtaga recombination has a crucial effect on the association between different alleles

Lecture 7.021 Modeling genetic drift: Genealogy present generation randomly mating population, genealogy evolves in a non- deterministic fashion

Lecture 7.022 Modeling genetic drift: Mutation mutation randomly “drift”: die out, go to higher frequency or get fixed

Lecture 7.023 Modulators: Natural selection negative (purifying) selection positive selection the genealogy is no longer independent of (and hence cannot be decoupled from) the mutation process

Lecture 7.024 Modeling ancestral processes “forward simulations” the “Coalescent” process By focusing on a small sample, complexity of the relevant part of the ancestral process is greatly reduced. There are, however, limitations.

Lecture 7.025 Models of demographic history past present stationaryexpansioncollapse MD (simulation) AFS (direct form) history bottleneck

Lecture 7.026 1. marker density (MD): distribution of number of SNPs in pairs of sequences Data: polymorphism distributions “rare” “common” 2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples Clone 1 Clone 2# SNPs AL00675AL009828 AS81034AK430010 CB00341AL432342 SNPMinor alleleAllele count A/GA1 C/TT9 A/GG3

Lecture 7.027 Model: processes that generate SNPs computable formulations simulation procedures 3/5 1/52/5

Lecture 7.028 Models of demographic history past present stationaryexpansioncollapse MD (simulation) AFS (direct form) history bottleneck

Lecture 7.029 best model is a bottleneck shaped population size history present N 1 =6,000 T 1 =1,200 gen. N 2 =5,000 T 2 =400 gen. N 3 =11,000 Data fitting: marker density Marth et al. PNAS 2003 our conclusions from the marker density data are confounded by the unknown ethnicity of the public genome sequence we looked at allele frequency data from ethnically defined samples

Lecture 7.030 present N1=20,000 T1=3,000 gen. N2=2,000 T2=400 gen. N3=10,000 model consensus: bottleneck Data fitting: allele frequency Data from other populations?

Lecture 7.031 Population specific demographic history European data African data bottleneck modest but uninterrupted expansion Marth et al. Genetics 2004

Lecture 7.032 Model-based prediction computational model encapsulating what we know about the process genealogy + mutations allele structure arbitrary number of additional replicates

Lecture 7.033 African dataEuropean data contribution of the past to alleles in various frequency classes average age of polymorphism Prediction – allele frequency and age

Lecture 7.034 How to use markers to find disease?

Lecture 7.035 Allelic association allelic association is the non- random assortment between alleles i.e. it measures how well knowledge of the allele state at one site permits prediction at another marker site functional site by necessity, the strength of allelic association is measured between markers significant allelic association between a marker and a functional site permits localization (mapping) even without having the functional site in our collection there are pair-wise and multi-locus measures of association

Lecture 7.036 Linkage disequilibrium LD measures the deviation from random assortment of the alleles at a pair of polymorphic sites D=f( ) – f( ) x f( ) other measures of LD are derived from D, by e.g. normalizing according to allele frequencies (r 2 )

Lecture 7.037 strong association: most chromosomes carry one of a few common haplotypes – reduced haplotype diversity Haplotype diversity the most useful multi-marker measures of associations are related to haplotype diversity 2 n possible haplotypesn markers random assortment of alleles at different sites

Lecture 7.038 Haplotype blocks Daly et al. Nature Genetics 2001 experimental evidence for reduced haplotype diversity (mainly in European samples)

Lecture 7.039 The promise for medical genetics CACTACCGA CACGACTAT TTGGCGTAT within blocks a small number of SNPs are sufficient to distinguish the few common haplotypes  significant marker reduction is possible if the block structure is a general feature of human variation structure, whole-genome association studies will be possible at a reduced genotyping cost this motivated the HapMap project Gibbs et al. Nature 2003

Lecture 7.040 The HapMap initiative goal: to map out human allele and association structure of at the kilobase scale deliverables: a set of physical and informational reagents

Lecture 7.041 HapMap physical reagents reference samples: 4 world populations, ~100 independent chromosomes from each SNPs: computational candidates where both alleles were seen in multiple chromosomes genotypes: high-accuracy assays from various platforms; fast public data release

Lecture 7.042 Informational: haplotypes the problem: the substrate for genotyping is diploid, genomic DNA; phasing of alleles at multiple loci is in general not possible with certainty experimental methods of haplotype determination (single-chromosome isolation followed by whole-genome PCR amplification, radiation hybrids, somatic cell hybrids) are expensive and laborious A T C T G C C A

Lecture 7.043 Haplotype inference Parsimony approach: minimize the number of different haplotypes that explains all diploid genotypes in the sample Clark Mol Biol Evol 1990 Maximum likelihood approach: estimate haplotype frequencies that are most likely to produce observed diploid genotypes Excoffier & Slatkin Mol Biol Evol 1995 Bayesian methods: estimate haplotypes based on the observed diploid genotypes and the a priori expectation of haplotype patterns informed by Population Genetics Stephens et al. AJHG 2001

Lecture 7.044 Haplotype inference http://pga.gs.washington.edu/

Lecture 7.045 Haplotype annotations – LD based Pair-wise LD-plots Wall & Pritchard Nature Rev Gen 2003 LD-based multi-marker block definitions requiring strong pair-wise LD between all pairs in block

Lecture 7.046 Annotations – haplotype blocks Dynamic programming approach Zhang et al. AJHG 2001 333 1. meet block definition based on common haplotype requirements 2. within each block, determine the number of SNPs that distinguishes common haplotypes (htSNPs) 3. minimize the total number of htSNPs over complete region including all blocks

Lecture 7.047 Haplotype tagging SNPs (htSNPs) Find groups of SNPs such that each possible pair is in strong LD (above threshold). Carlson AJHG 2005

Lecture 7.048 http://bioinformatics.bc.edu/marthlab

Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.

Similar presentations

Presentation on theme: "Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.

Similar presentations

Presentation on theme: "Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June."— Presentation transcript:

Similar presentations

About project

Feedback