Science of Information: Case Studies in DNA and RNA assembly

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Transcriptome Sequencing with Reference
The Theory of NP-Completeness
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Register Allocation (via graph coloring)
Analysis of Algorithms CS 477/677
High Throughput Sequencing: Microscope in the Big Data Era
RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.
High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,
De-novo Assembly Day 4.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
EECS 203: It’s the end of the class and I feel fine. Graphs.
Computational Complexity Polynomial time O(n k ) input size n, k constant Tractable problems solvable in polynomial time(Opposite Intractable) Ex: sorting,
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
The iPlant Collaborative
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
NP-Complete problems.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Grade 11 AP Mathematics Graph Theory Definition: A graph, G, is a set of vertices v(G) = {v 1, v 2, v 3, …, v n } and edges e(G) = {v i v j where 1 ≤ i,
1 CSE 326: Data Structures: Graphs Lecture 23: Wednesday, March 5 th, 2003.
Information Theory of High-throughput Shotgun Sequencing David Tse Dept. of EECS U.C. Berkeley Tel Aviv University June 4, 2012 Research supported by NSF.
The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley UBC September 14, 2012 Research supported by NSF.
Introduction to Graph & Network Theory Thinking About Networks: From Metabolism to the Genome to Social Conflict Summer Workshop for Teachers June 27 th.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
More NP-Complete and NP-hard Problems
P & NP.
School of Computer Science & Engineering
Assembly algorithms for next-generation sequencing data
EECS 203 Lecture 19 Graphs.
Routing Through Networks - 1
CAP5510 – Bioinformatics Sequence Assembly
A Fast Hybrid Short Read Fragment Assembly Algorithm
School of Computing Clemson University Fall, 2012
How to Solve NP-hard Problems in Linear Time
Week 11 - Monday CS221.
EECS 203 Lecture 20 More Graphs.
Structural testing, Path Testing
CS4234 Optimiz(s)ation Algorithms
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Do You Want to Build a Transcriptome?
Graph Theory.
Genome Assembly.
Finding a Eulerian Cycle in a Directed Graph
Heuristic Algorithms via VBA
Genetic Algorithms CSCI-2300 Introduction to Algorithms
On the k-Closest Substring and k-Consensus Pattern Problems
Chapter 11 Limitations of Algorithm Power
Graph Algorithms in Bioinformatics
Algorithms for Budget-Constrained Survivable Topology Design
Introduction to Sequencing
Heuristic Algorithms via VBA
Heuristic Algorithms via VBA
Artificial Intelligence CIS 342
CSC 380: Design and Analysis of Algorithms
Approximation Algorithms
Schematic representation of a transcriptomic evaluation approach.
Learning a hidden graph with adaptive algorithms
Presentation transcript:

Science of Information: Case Studies in DNA and RNA assembly David Tse Stanford University MIIS Workshop December 18, 2016 Research supported by the NSF Center of Science of Information. TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

History In 1948, Claude Shannon invented a mathematical theory of communication. 70 years later, all communication systems are designed based on the principles of information theory.

Information and computation Shannon looked at information limits without computational consideration. Yet, 70 years of intense research ultimately reveals computationallly efficient coding schemes achieving these limits.

Information before computation C.E. Shannon A.M. Turing What is the information limit for communication? But optimal decoding of general codes is NP-hard! I only care about problem instances that matter.

Beyond communication Can information be used as a guiding design principle for other problems?

High throughput sequencing revolution Faster than Moore;s Law Implication to the IT community Sequencing = Biochemistry + Computation

N randomly located reads DNA Assembly Problem N randomly located reads of length L Shotgun Sequencer Assembler T C G C G A T T C G A C G C A T T C G C G A T T C G A T T C G C A T G C G A T T T C G C C A T T C A C G C A Add errors? Mention de novo? In this talk, I want to contrast to aspects of this question. The computational aspect, and the informational aspect A C G C A T T C G C G A T T G = 106 to 1010 N = 107 to 109 L= 102 to 104

Theory for assembly Information Computation Formulate assembly as a combinatorial optimization problem. e.g., Shortest Common Superstring Typically NP-hard Heuristics How much data is required for unambiguous reconstruction? Can answering thus question help “avoid” NP-hardness? If you want to talk about computational complexity, you first need a mathematical formulation of this problem It makes sense to develop algorithms aiming perfect assembly I won’t be arguing that one approach is better than the other, but that they can provide different perspectives when developing algorithms Erase things at the end. Show outline on the right, with what will actually be shown?

Key challenge: repeats harder jigsaw puzzle easier jigsaw puzzle How exactly do the information limits depend on repeats?

Data-driven information limit Bresler, Bresler & T. BMC Bioinformatics 13 # of repeats information lower bound Start with individual sequence, extract sufficient statistics, get curves repeat length Lander-Waterman coverage Human Chr 19 Build 37

Lower bound: interleaved repeats Necessary condition: all interleaved repeats are bridged.

Lower bound: triple repeats Necessary condition: all triple repeats are bridged

Approaching the limit Read-overlap graph Shomorony, Courtade & T. Bioinformatics, 2016 Read-overlap graph Sequence is a path that visits every node. (Generalized) Hamiltonian Path Finding optimal GHP is NP-hard 2 4 1 3 CGCAT CATTC TCGCG ACGCA ATTCG ACGCATTCGCG

Approaching the limit Read-overlap graph Sequence is a path that visits every node (Generalized) Hamiltonian Path Finding shortest GHP is NP-hard

How well does a greedy algorithm do? For each node, pick edge with best overlap 3 4 3 N L 1 NLW 1 1 ? repeat(s) Greedy 5 5 1 6 5 6 1 4 2 1 4 7 1 5 Greedy on title? We will do this sparsification/pruning based on insights from the information limits It turns out Generalized Hamiltonian path 7 Greedy approach fails. What if there are long repeats?

Insights from Information limit True path may need to visit a node more than once 3 4 3 N L 1 NLW 1 1 5 5 1 6 5 6 1 (𝑁,𝐿) 2 1 4 4 7 1 5 Greedy on title? We will do this sparsification/pruning based on insights from the information limits It turns out Generalized Hamiltonian path 7

Insights from information limit Can true path visit a node > 2 times? 3 4 3 N L 1 1 NLW 5 5 1 6 5 6 1 (𝑁,𝐿) 2 1 4 4 7 1 5 7

Insights from information limit Can true path visit a node > 2 times? Path visits each node ≤2 times 3 4 3 N L 1 1 NLW 5 5 1 6 5 6 1 (𝑁,𝐿) 4 2 1 4 7 1 5 7 Two paths are indistinguishable!

Not-so-greedy algorithm Keep only the 2 best extensions at each node. Further pruning removes spurious edges. Results in a sparse read-overlap graph.

Not-so-greedy: performance guarantee Theorem 1: If all triple repeats are all bridged, then no spurious and no missing edges in sparse graph. i.e. genome is an Eulerian path in the graph. Theorem 2: If furthermore all interleaved repeats are bridged, then unique Eulerian path.

Not-so-greedy: near optimality for Chr 19 lower bound 4. Multibridging is the algorithm we propose, which is nearly optimal, at least for chromosome 19. Did we get lucky? length Not-SO-GREEDY Lander-Waterman coverage Human Chr 19 Build 37

GAGE Benchmark Datasets http://gage.cbcb.umd.edu/ Rhodobacter sphaeroides Staphylococcus aureus Human Chromosome14 G = 4,603,060 G = 2,903,081 G = 88,289,540 What about the lower bound? NOT-SO-GREEDY NOT-SO-GREEDY lower bound NOT-SO-GREEDY lower bound lower bound

From NP-hard to linear time read-overlap graph: Hamiltonian sparse read-overlap graph: Eulerian N L NLW (𝑁,𝐿)

Long-read assembler: HINGE Kamath et al 2016 Genome Research, under review github.com/fxia22/HINGE Evaluation: Pacific Biosciences data on bacterial genomes (NCTC dataset) Total 688 HINGE finished assemblies 583 HGAP (Chin et al, 2013) 517 Miniasm (Li, 2015) 513 Cross section, designing algorithms according to this

Alternatively spliced isoforms. From DNA to RNA AGTTG GGAAT ACACAA DNA GGCTTACC TCGAGTTC TATCATTTT AAGTAAA Exon Intron 1000’s to 10,000’s symbols long GGCTTACC TATCATTTT AAGTAAA TCGAGTTC AAGTAAA RNA Transcript 1 RNA Transcript 2 Alternatively spliced isoforms.

Assembler reconstructs RNA-Seq assembly Transciptome Reads GGCTTACC TATCATTTT AAGTAAA CGAGT GGCTTACC TATCATTTT AAGTAAA Assembler reconstructs transcriptome TCGAGTTC AAGTAAA TCAAG TCGAGTTC AAGTAAA TCGAGTTC AAGTAAA AGTAA L=5

RNA assembler: Shannon Kannan, Pachter & T. Nature Biotech, under review http://sreeramkannan.github.io/Shannon Evaluations: 135M Illumina L = 50 reads from human embryonic stem cells. (Au et al, PNAS 2013) 110 million Illumina L= 101 paired end reads from the Lymphoblastoid cells in the GM12878 cell line. (Tilgner et al, PNAS 2014)

Human embryonic stem cells dataset

Lymphoblastoid dataset

Conclusion Information theory is about fundamental limits. It is a constructive theory. It overcomes computationally intractable problems by focusing on tractable instances.