Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.

Slides:



Advertisements
Similar presentations
Association Tests for Rare Variants Using Sequence Data
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
Sampling distributions of alleles under models of neutral evolution.
Genotype and Haplotype Reconstruction from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
University of Connecticut
Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Ion Mandoiu Computer Science and Engineering Department
Lecture 5: Learning models using EM
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Evaluating Hypotheses
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
High Throughput Sequencing
Habil Zare Department of Genome Sciences University of Washington
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Todd J. Treangen, Steven L. Salzberg
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Hidden Markov Models for Sequence Analysis 4
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Imputation-based local ancestry inference in admixed populations
Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
Analysis of Next Generation Sequence Data BIOST /06/2015.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Constrained Hidden Markov Models for Population-based Haplotyping
Imputation-based local ancestry inference in admixed populations
Discovery tools for human genetic variations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals  Brian L. Browning, Sharon.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen Hernandez 2, Ion Mandoiu 1, and Yufeng Wu 1 1 CSE Department, University of Connecticut 2 Department of Computer Science, Hunter College

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem HMM-Posterior Algorithm Experimental Results Conclusion

Next Generation Sequencing (NGS) Illumina / Solexa Genetic Analyzer 1G 1000 Mb/run, 35bp reads Roche / 454 Genome Sequencer FLX 100 Mb/run, 400bp reads Applied Biosystems SOLiD 3000 Mb/run, 25-35bp reads New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing More improvements expected in quest for $1,000 genome

NGS Applications Besides reducing costs of de novo genome sequencing, NGS has found many more apps: Resequencing, transcriptomics (RNA-Seq), gene regulation (non- coding RNAs, transcription factor binding sites using ChIP-Seq), epigenetics (methylation, nucleosome modifications), metagenomics, paleogenomics, … NGS is enabling personal genomics James Watson genome [Wheeler et al 08] sequenced using 454 technology for ~$1 million compared to ~$100 million for the Sanger-sequenced Venter genome [Levy et al 07] Thousands more individual genomes to be sequenced as part of 1000 Genomes Project 1000 Genomes Project

Challenges in Medical Applications of Sequencing Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements) Requires accurate determination of both alleles at variable loci This is limited by coverage depth due to random nature of shotgun sequencing For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips has shown only 75-80% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08] [Wendl&Wilson 08] predict that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs”

Do Heuristic Inputs Help? We propose methods incorporating two additional sources of information: Quality scores reflecting uncertainty in sequencing data Allele frequency and linkage disequilibrium (LD) info extracted from reference panels such as Hapmap Experiments on a subset of the James Watson 454 reads show that our methods yield improved genotyping accuracy Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by the binomial test of [Wheeler et al. 08] for 5.6-fold mapped read coverage is achieved by our methods using less than 1/4 of the reads

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem HMM-Posterior Algorithm Experimental Results Conclusion

Basic Notations Biallelic SNPs: 0 = major allele, 1 = minor allele (reads with non-reference alleles are discarded) SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous Inferred genotypes Mapped reads with allele 0 Mapped reads with allele Sequencing errors

Prior Methods for Calling SNP Genotypes from Read Data Prior methods are all based on allele coverage [Levy et al 07] require that each allele be covered by at least 2 reads in order to be called [Wheeler et al 08] use hypothesis testing based on the binomial distribution To call a heterozygous genotype must have each allele covered by at least one read and the binomial probability for the observed number of 0 and 1 alleles must be at least 0.01 [Wendl&Wilson 08] generalize these methods by allowing an arbitrary minimum allele coverage k

Incorporating Base Call Uncertainty Let r i denote the set of mapped reads covering SNP locus i and c i =| r i | For a read r in r i, r(i) denotes the allele observed at locus i If q r(i) is the phred quality score of r(i), the probability that r(i) is incorrect is given by The probability of observing read set r i conditional on having genotype G i is then given by:

Single SNP Genotype Calling Applying Bayes’ formula: Where are allele frequencies inferred from a representative panel

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem HMM-Posterior Algorithm Experimental Results Conclusion

Probabilistic Model for Multilocus Case F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HMMs representing LD in populations of origin for mother/father; similar to models used in [Scheet & Stephens 06, Rastas et al 08, Kennedy et al 08]

Model Training Initial founder probabilities P(f 1 ), P(f’ 1 ), transition probabilities P(f i+1 |f i ), P(f’ i+1 |f’ i ), and emission probabilities P(h i |f i ), P(h’ i |f’ i ) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father P(g i |h i,h’ i ) set to 1 if h+h’ i =g i and to 0 otherwise This implies that conditional probabilities for sets of reads are given by the formulas derived for the single SNP case:

Multilocus Genotyping Problem GIVEN: Shotgun read sets r=(r 1, r 2, …, r n ) Quality scores Trained HMM models representing LD in populations of origin for mother/father FIND: Multilocus genotype g*=(g* 1,g* 2,…,g* n ) with maximum posterior probability, i.e., g*=argmax g P(g | r ) Remark: max g P(g | r) is hard to approximate within unless ZPP=NP, and thus the multilocus genotyping problem is NP-hard

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem HMM-Posterior Algorithm Experimental Results Conclusion

HMM-Posterior Decoding Algorithm 1. For each i = 1..n, compute 2. Return

Forward-Backward Computation of Posterior Probabilities fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … …

Forward-Backward Computation of Posterior Probabilities fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … …

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-Backward Computation of Posterior Probabilities

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-Backward Computation of Posterior Probabilities

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-Backward Computation of Posterior Probabilities

Implementation Details Forward recurrences: Backward recurrences are similar

Runtime Direct implementation gives O(m+nK 4 ) time, where m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs Runtime reduced to O(m+nK 3 ) by reusing common terms: where

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem HMM-Posterior Algorithm Experimental Results Conclusion

Read Data Subset of James Watson’s 454 reads 74.4 million reads with quality scores (of million reads used in [Wheeler et al 08]) downloaded from ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/ ftp://ftp.ncbi.nih.gov/pub/TraceDB/Personal_Genomics/Watson/ Average read length ~265 bp

Read Mapping Procedure Reads mapped on human genome build 36.3 using the nucmer tool of the MUMmer package [Kurtz et al 04] Default nucmer parameters (MUM size 20, min cluster size 65, max gap between adjacent matches 90) Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels) Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded Simulated 454 reads generated using ReadSim [Schmid et al 07] were used to estimate mapping error rates: FP rate: 0.37% FN rate: 21.16%

Read Mapping Results Average coverage by mapped reads of Hapmap SNPs was 5.64x Lower than [Wheeler et al 08] since we start with a subset of the reads and use more stringent mapping constraints

Haplotype and Genotype Data CEU genotypes from latest Hapmap release (23a) were dowloaded from Genotypes were phased using the ENT algorithm [Gusev et al 08] and inferred haplotypes were used to train the parent HMM models using the Baum-Welch algorithm Duplicate Affymetrix 500k SNP genotypes were downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family. soft.gz ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE10668/GSE10668_family. soft.gz We removed genotypes that were discordant in the two replicates and genotypes for which Hapmap and Affymetrix annotations had more than 5% in CEU same-strand allele frequency

Accuracy Comparison (Homozygous Genotypes)

Accuracy Comparison (Heterozygous Genotypes)

Accuracy Comparison (All Genotypes)

Accuracy at Varying Coverages (All Genotypes)

Outline Introduction Single SNP Genotype Calling Multilocus Genotype Problem (MGP) HMM-Posterior Algorithm Experimental Results Conclusion

Exploiting “heuristic inputs” such as quality scores and population allele frequency and LD information yields significant improvements in genotyping calling accuracy from low-coverage sequencing data LD information extracted from a reference panel gives highest benefit Relatively small gain from incorporating quality scores may be due in part to the poor calibration of 454 quality scores [Brockman et al 08, Quinlan et al 08] Although our evaluation is on 454 reads, the methods are well- suited for short read technologies Ongoing work includes modeling ambiguities in read mapping and extending the methods to population sequencing data (removing the need for reference panels)

Acknowledgments This work was supported in part by NSF (awards IIS , DBI , and CCF ) and by the University of Connecticut Research Foundation

Questions?