1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.

Slides:



Advertisements
Similar presentations
Filling Algorithms Pixelwise MRFsChaos Mosaics Patch segments are pasted, overlapping, across the image. Then either: Ambiguities are removed by smoothing.
Advertisements

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Molecular Evolution Revised 29/12/06
June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Heuristic alignment algorithms and cost matrices
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Department of Computer Science Stony Brook University.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Routing 2 Outline –Maze Routing –Line Probe Routing –Channel Routing Goal –Understand maze routing –Understand line probe routing.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Alignment Introduction Notes courtesy of Funk et al., SIGGRAPH 2004.
Topology Design for Service Overlay Networks with Bandwidth Guarantees Sibelius Vieira* Jorg Liebeherr** *Department of Computer Science Catholic University.
Todd J. Treangen, Steven L. Salzberg
Network Aware Resource Allocation in Distributed Clouds.
Inferring Genomic Sequences Irina Astrovskaya Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Improved Approximation Algorithms for the Quality of Service Steiner Tree Problem M. Karpinski Bonn University I. Măndoiu UC San Diego A. Olshevsky GaTech.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
A New Hybrid Wireless Sensor Network Localization System Ahmed A. Ahmed, Hongchi Shi, and Yi Shang Department of Computer Science University of Missouri-Columbia.
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Prof. Yu-Chee Tseng Department of Computer Science
ISP and Egress Path Selection for Multihomed Networks
Alexander Zelikovsky Computer Science Department
Quality of Service in Multimedia Distribution
Problem Solving 4.
SEG5010 Presentation Zhou Lanjun.
Fragment Assembly 7/30/2019.
Presentation transcript:

1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex Zelikovsky, GSU Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing Reads CAME 2011, Atlanta Georgia

2 Viral Quasispecies and NGS RNA Viruses —HIV, HCV, SARS, Influenza —Higher (than DNA) mutation rates —  quasispecies —set of closely related variants rather than a single species Knowing quasispecies can help —Interferon HCV therapy effectiveness (Skums et al 2011) NGS allows to find individual quasispecies sequences —454 Life Sciences : Mb with reads bp long Sequencing is challenging —multiple quasispecies —qsps sequences are very similar —different qsps may be indistinguishable for > 1kb (longer than reads) CAME 2011, Atlanta Georgia

3 Outline Shotgun vs Amplicon Sequencing Viral Quasispecies Reconstruction Problem Challenges and Approaches Data Structure for Reads: Read Graph Novel Methods for Solving QSR Problem Observed vs True Read Frequencies True Frequency Reconstruction Simulations and Results CAME 2011, Atlanta Georgia

4 Shotgun versus Amplicon Sequencing Shotgun reads —starting positions distributed uniformly Amplicon —each read has predefined start/end covering fixed overlapping windows CAME 2011, Atlanta Georgia

5 Viral Quasispecies Spectrum Reconstruction Problem Given —collection of amplicon reads from a quasispecies population with unknown variants and distribution Find —viral quasispecies sequences and their frequencies CAME 2011, Atlanta Georgia

6 Amplicon Sequencing Challenges Collapse of quasispecies in amplicon —distinct quasispecies may be indistinguishable in window Collapse of quasispecies in overlap —match reads from consecutive windows coming from the same qsp First approach Prosperi et al (2011) —Guide Distribution —choose a column —go right/left matching the the closest in order neighbor CAME 2011, Atlanta Georgia

7 Approaches to QSP Reconstruction Shotgun approaches —estimates probability of consecutive reads coming from the same qsp (ViSpA, Astrovskaya et al 2011) —parsimony (minimum number of distinct sequences covering all reads) (ShoRAH, Zagordi et al 2010) Why not use shotgun approaches for amplicons? —estimating probability in ViSpA relies on uniform distribution of reads —amplicon reads have fixed beginnings and ends Optimization approach —most parsimonious solution — minimize number of distinct sequences covering all reads — too coarse: many different optimal solutions —minimum information entropy (Shannon, 1948) — takes in account also frequency — fractional relaxation of pure parsimony CAME 2011, Atlanta Georgia

Min Entropy vs Parsimony Parsimony and Min Entropy selects AC and BD if a = c, and b = d 8 CAME 2011, Atlanta Georgia

9 Data Structure for Reads: Read Graph K amplicons → K-staged read graph —vertices → distinct reads —edges → reads with consistent overlap —vertices, edges have a count function CAME 2011, Atlanta Georgia

10 Read Graph May transform graph into a 'forked' graph —overlap is represented by fork vertex CAME 2011, Atlanta Georgia

11 Fork Resolving Problem Minimum Entropy is NP-hard —can solve it optimally for each small fork separately (future work) Greedy heuristic — ≤ a+b-1 are sufficient when resolving fork with a distinct reads on the left and b on the right — that can be done greedily matching largest (greedy heuristic) — this does not guarantee minimum number of distinct qsps Better way = globally match the most frequent reads (max bandwidth) — find s-t path maximizing minimum read count — subtract the minimum count from each read in the path — exhausts at least one read in the path CAME 2011, Atlanta Georgia

12 Greedy Method CAME 2011, Atlanta Georgia

13 Greedy Method CAME 2011, Atlanta Georgia

14 Greedy Method CAME 2011, Atlanta Georgia

15 Greedy Method CAME 2011, Atlanta Georgia

16 Greedy Method CAME 2011, Atlanta Georgia

17 Greedy Method CAME 2011, Atlanta Georgia

18 Greedy Method CAME 2011, Atlanta Georgia

19 Greedy Method CAME 2011, Atlanta Georgia

20 Greedy Method CAME 2011, Atlanta Georgia

21 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

22 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

23 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

24 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

25 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

26 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

27 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

28 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

29 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

30 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

31 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

32 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

33 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

34 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

35 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

36 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

37 Maximum Bandwidth Method CAME 2011, Atlanta Georgia

38 Observed vs Ideal Read Frequencies Ideal frequency —consistent frequency across forks Observed frequency (count) —inconsistent frequency across forks All methods perform better under ideal frequencies CAME 2011, Atlanta Georgia

39 Fork Balancing Problem Given —set of reads and respective frequencies Find —minimal frequency offsets balancing all forks Simplest approach is to scale frequencies from left to right CAME 2011, Atlanta Georgia

40 Least Squares Approach Quadratic Program for read offsets q – fork, o i – observed frequency, x i – frequency offset CAME 2011, Atlanta Georgia

41 Flowchart CAME 2011, Atlanta Georgia

42 Data Sets and Metrics Simulated error-free HCV (1734 long fragment) – quasispecies from uniform, geometric, and skewed distribution – shift → delta of starting position Sensitivity – percentage of correctly assembled true quasispecies PPV – percentage of true quasispecies among all assembled Jensen-Shannon Divergence

43 Sensitivity Results CAME 2011, Atlanta Georgia

44 PPV Results CAME 2011, Atlanta Georgia

45 Divergence Results CAME 2011, Atlanta Georgia

46 ViSpA Comparison CAME 2011, Atlanta Georgia

47 Conclusion Two novel methods for solving QSR problem —Outperform Prosperi et al. on average —Outperform ViSpA approach on average Maximum Bandwidth approach worked best Future work: exact local solution for minimum entropy CAME 2011, Atlanta Georgia

48 Thanks CAME 2011, Atlanta Georgia