Download presentation
Presentation is loading. Please wait.
Published byAlice Brooks Modified over 8 years ago
1
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and Alex Zelikovsky Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data
2
Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
3
Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
4
Advances in High-Throughput Sequencing (HTS) 4 http://www.economist.com/node/16349358 Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
5
5 Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work
6
ML Model Panel : bipartite graph —LEFT: genomic sequences (strings) >unknown frequencies —RIGHT: reads >observed frequencies —EDGES: probability of the read to be emitted by the string >weights are calculated based on the mapping of the reads to the strings strings S1 S2 S3 R1 R2 R4 reads R3 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
7
Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
8
ML estimates of string frequencies Probability that a read is sampled from string is proportional with its frequency f(j) ML estimates for f(j) is given by n(j)/(n(1) +... + n(N)) —n(j) - number of reads sampled from string j WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
9
EM algorithm Initialization E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all reads in the sample WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
10
Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
11
ML Model Quality How well the maximum likelihood model explain the reads Measured by deviation between expected and observed read frequencies —expected read frequency: WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
12
VSEM : Virtual String EM ML estimates of string frequencies ML estimates of string frequencies Compute expected read frequencies Compute expected read frequencies Update weights of reads in virtual string Update weights of reads in virtual string EM (Incomplete) Panel + Virtual String with 0-weights in virtual string (Incomplete) Panel + Virtual String with 0-weights in virtual string Virtual String frequency change>ε? Output string frequencies and weights Output string frequencies and weights EM YES NO WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
13
Example : 1 st iteration 13 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
14
14 Example : 1 st iteration 14 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
15
Example : 1 st iteration 15 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE.32.25.32.25.16.25.16 ML.25.5.25 ML.34.66 VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
16
Example : 1 st iteration 16 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE.3.25.3.25.15.25.15 ML.25.5.25 0 ML.32.65.02 VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
17
Example : last iteration 17 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML.25.5.25 0 ML.20.6.2 VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
18
VSEM : Virtual String EM Decide if the panel is likely to be incomplete Estimate total frequency of missing strings Identify read spectrum emitted by missing strings WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
19
VSEM : Applications RNA-Seq —inferring isoform expressions from RNA-Seq Viral Quasispecies Sequencing by 454 pyrosequencing —inferring viral quasispecies spectrum from pyrosequencing shotgun reads 19 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
20
Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
21
21 RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression (GE) ABC AC DE Isoform Discovery (ID) Isoform Expression (IE) WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
22
Previous Approach IsoEM [Nicolae et al. 2011] – novel expectation- maximization algorithm for inference of alternative splicing isoform frequencies from RNA-Seq data —Single and/or paired reads —Fragment length distribution —Strand information —Base quality scores —Insert sizes (library preparation) 22 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
23
Simulation Setup Human genome UCSC/CCDS known isoforms —UCSC : 66803 isoforms, 19372 genes — CCDS : 20829 isoforms, 17373 genes GNFAtlas2 gene expression levels —geometric expression of gene isoforms Normally distributed fragment lengths — Mean 250, std. dev. 25 23 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
24
EXP1 : Reduced transcriptome data Comparison between IsoEM and IsoVSEM on reduced transcriptome data —in every gene 25% of isoforms is missing —isoforms inside the gene - geometric distribution(p=0.5) —select genes with number of isoforms inside the gene is less or equal to 3. —removed isoforms with frequency 0.25 24 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
25
EXP2 : CCDS panel UCSC database represents the full panel CCDS represents the incomplete panel —reads were generated from UCSC library of isoforms —only frequencies of known isoforms(CCDS) were estimated 25 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
26
Error Fraction Curves EXP1, 30M reads of length 25 26 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
27
Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
28
454 Pyrosequencing Pyrosequencing =Sequencing by Synthesis. GS FLX Titanium : — Divides the source genetic material into reads (300-800 bp) 28 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
29
Previous Approach ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads —align reads —built a read graph : >V – reads >E – overlap between reads >each path – candidate sequence —filter based on ML frequencies 29 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
30
ViSpA-VSEM 30 ViSPA Weighted assembler assembled Qsps Qsps Library VSEM Virtual String EM reads, weights Viral Spectrum +Statistics reads ViSpA ML estimator removing duplicated & rare qsps Stopping condition YES NO WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
31
Simulation Setup Real quasispecies sequences data from [von Hahn et al. 2006] —44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus —populations sizes: 10, 20, 30, and 40 sequences —population distributions: geometric, skewed normal, uniform 31 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
32
Experimental Validation of VSEM Detection of panel incompleteness —VSEM can detect >1% of missing strings Improving quasispecies frequencies estimations Detection of reads emitted by missing string —Correlation between predicted reads and reads emitted by missing strings >65% 32 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
33
VSEM improving frequencies estimates 33 % of missing strings r.l./n.r<10%10%-20%20%-30%30%-40%40%-50%>50% rerrr r r r r ViSpA-EM100/20K90.24.591.06.875.45.168.61.640.82.339.810.4 ViSpA-VSEM100/20K91.62.392.84.476.54.170.51.454.22.050.87.4 ViSpA-EM300/20K95.73.893.210.289.81.066.71.562.12.146.89.7 ViSpA-VSEM300/20K95.41.795.81.196.90.685.70.988.00.960.42.6 ViSpA-EM100/100K95.24.593.99.184.81.474.21.874.52.373.49.9 ViSpA-VSEM100/100K97.82.695.63.086.31.379.81.779.02.174.28.8 ViSpA-EM300/100K96.23.988.612.488.91.085.11.475.12.349.510.5 ViSpA-VSEM300/100K96.22.092.80.993.70.790.21.284.41.767.14.8 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany r - Correlation between real and predicted frequencies; err - average prediction error
34
ViSpA vs ViSpA-VSEM 34 ViSpAViSpA-VSEM DistributionPPVSensitivityrerrPPVSensitivityrerrGain Geometric0.7670.50.9547.360.5910.730.9092.912.3 Skewed0.7330.40.67313.010.7010.770.9672.54 Uniform0.7330.40.71612.760.6450.730.9762.343.7 100K reads from 10 QSPS average length 300 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany r - Correlation between real and predicted frequencies; err - average prediction error
35
Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
36
Conclusions We propose VSEM, a novel modification of EM algorithm —improves the ML frequency estimations of multiple genomic sequences —identifies reads that belong to unassembled(missing) sequences We applied VSEM to improve two tools: —IsoEM —ViSpA 36 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
37
Future work Assemble strings from the set of reads emitted by missing strings Improve other metagenomics tools 37 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
38
Acknowledgments NSF awards IIS-0546457 IIS-0916948, and DBI- 0543365. NSF award IIS-0916401 Agriculture and Food Research Initiative Competitive Grant no. 2011-67016-30331 from the USDA National Institute of Food and Agriculture. 38 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany
39
Thanks 39
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.