Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ion Mandoiu Computer Science and Engineering Department

Similar presentations


Presentation on theme: "Ion Mandoiu Computer Science and Engineering Department"— Presentation transcript:

1 Linkage Disequilibrium Based SNP Genotype Calling from Short Sequencing Reads
Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S. Dinakar, J. Duitama, Y. Hernández, J. Kennedy, and Y. Wu

2 Ultra-High Throughput Sequencing
Recent massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to classic Sanger sequencing -SBS: Sequencing by Synthesis -SBL: Sequencing by Ligation -Challenges in Genome Assembly: The short read lengths and absence of paired ends make it difficult for assembly software to disambiguate repeat regions, therefore resulting in fragmented assemblies. -New Type of sequencing error: in 454 including incorrect estimates of homopolymer lengths, ‘transposition-like’ insertions (a base identical to a nearby homopolymer is inserted in a nearby nonadjacent location) and errors caused by multiple templates attached to the same bead Roche/454 FLX Titanium 400bp reads 400Mb/10h run ABI SOLiD 2.0 25-35bp reads 3-4Gb/6 day run Helicos HeliScope 25-55bp reads >1Gb/day Illumina Genome Analyzer II 35-50bp reads 1.5Gb/2.5 day run 2

3 Personal Genomes: The Future is Now!
C.Venter J. Watson NA18507 -SBS: Sequencing by Synthesis -SBL: Sequencing by Ligation -Challenges in Genome Assembly: The short read lengths and absence of paired ends make it difficult for assembly software to disambiguate repeat regions, therefore resulting in fragmented assemblies. -New Type of sequencing error: in 454 including incorrect estimates of homopolymer lengths, ‘transposition-like’ insertions (a base identical to a nearby homopolymer is inserted in a nearby nonadjacent location) and errors caused by multiple templates attached to the same bead 3

4 Challenges for Genomic Medicine at Single-Base Resolution
Medical sequencing focuses on genetic variation (SNPs, CNVs, genome rearrangements) Requires accurate determination of both alleles at variable loci This is limited by coverage depth due to random nature of shotgun sequencing For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips has shown only ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08] [Wendl&Wilson 08] predict that 21x coverage is required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs” What heuristic inputs help? How much can we gain from improved data analysis? 4

5 Linkage Disequilibrium: Sources & Modeling
HMM model of haplotype frequencies F1 F2 Fn H1 H2 Hn Fi = founder haplotype at locus i, Hi = observed allele at locus i P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference panel such as Hapmap For given haplotype h with n SNPs, P(H=h|M) can be computed in O(nK2) using forward algorithm, where K=#founders

6 Pipeline for LD-Based Genotype Calling
Read sequences Reference genome sequence GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02GTXK0 TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA Mapped reads GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02GTXK0 TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02GTXK0 TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC Quality scores >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT SNP genotype calls Hapmap genotypes rs T T e-01 rs C T e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs C C e-01 rs A G e-01 rs C C e-01 rs C C e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs A C e-01 rs G G e-01 rs A A e-01 rs A A e-01 rs A A e-01 rs T T e-01 rs G G e-01 rs C G e-01 rs G T e-01 rs G G e-01 rs C C e-01 rs A C e-01 rs G G e-01 rs C C e-01 rs C C e-01 rs C C e-01 NOTE: P(g|r) is NP-Hard… 16 F 0 0 ?100201? ? 18 F 15 16 ? ? 15 M 0 0 8 F 0 0 ? 7 M 0 0 ? 9 M 0 0 12 F 9 10 11 M 7 8 011?001? ? 16 F 0 0 ?100201? ? 18 F 15 16 ? ? 15 M 0 0 8 F 0 0 ? 7 M 0 0 ? 9 M 0 0 12 F 9 10 11 M 7 8 011?001? ? 16 F 0 0 ?100201? ? 18 F 15 16 ? ? 15 M 0 0 8 F 0 0 ? 7 M 0 0 ? 9 M 0 0 12 F 9 10 11 M 7 8 011?001? ? 6

7 Genotype Calling Accuracy vs. Coverage
Watson/454 reads NA18507/Illumina reads

8 Conclusions & Ongoing Work
Exploiting LD information yields significant improvements in genotyping calling accuracy and/or cost reduction Accuracy achieved by previously proposed binomial test is achieved by HMM-based posterior decoding algorithm using less than 1/4 of the reads Ongoing work Modeling ambiguities in read mapping Haplotype inferrence Extension to population sequencing data (removing need for reference panels) ACKNOWLEDGEMENTS This work was supported in part by NSF under awards IIS and DBI to IM and IIS to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF under award CCF


Download ppt "Ion Mandoiu Computer Science and Engineering Department"

Similar presentations


Ads by Google