High Throughput Sequencing: Microscope in the Big Data Era

Slides:

Advertisements

Similar presentations

Marius Nicolae Computer Science and Engineering Department

Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.

3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.

Lecture 5: Learning models using EM

1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.

Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.

High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.

Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.

The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,

De-novo Assembly Day 4.

LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.

© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.

Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.

CS 394C March 19, 2012 Tandy Warnow.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

Todd J. Treangen, Steven L. Salzberg

Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

The iPlant Collaborative

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

The iPlant Collaborative

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

Information Theory of High-throughput Shotgun Sequencing David Tse Dept. of EECS U.C. Berkeley Tel Aviv University June 4, 2012 Research supported by NSF.

The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley UBC September 14, 2012 Research supported by NSF.

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Lesson: Sequence processing

Assembly algorithms for next-generation sequencing data

Gene expression from RNA-Seq

Science of Information: Case Studies in DNA and RNA assembly

How to Solve NP-hard Problems in Linear Time

Introduction to Genome Assembly

CS 598AGB Genome Assembly Tandy Warnow.

Reference based assembly

Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey

Sequence Analysis - RNA-Seq 2

Presentation transcript:

High Throughput Sequencing: Microscope in the Big Data Era Sreeram Kannan and David Tse Tutorial ISIT 2014 Research supported by NSF Center for Science of Information. TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

DNA sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

High throughput sequencing revolution tech. driver for communications Faster than Moore;s Law Implication to the IT community

Shotgun sequencing read

Technologies Sequencer Sanger 3730xl 454 GS Ion Torrent SOLiDv4 Illumina HiSeq 2000 Pac Bio Mechanism Dideoxy chain termination Pyrosequencing Detection of hydrogen ion Ligation and two-base coding Reversible Nucleotides Single molecule real time Read length 400-900 bp 700 bp ~400 bp 50 + 50 bp 100 bp PE 1000~10000 bp Error Rate 0.001% 0.1% 2% 10-15% Output data (per run) 100 KB 1 GB 100 GB 1 TB 10 GB

High throughput sequencing: Microscope in the big data era Genomic variations, 3-D structures, transcription, translation, protein interaction, etc. The quantities measured can be dynamic and vary spatially. Example: RNA expression is different in different tissues and at different times. HTS as the 21st century microscope Pachter diagram

Computational problems for high throughput data measure data manage utilize Assembly (de Novo) Variant calling (reference-based assembly) Compression Privacy Genome wide association studies Phylogenetic tree reconstruction Pathogen detection Engineering challenges to manage massive data sets: Compression for storage, communication and retrieval. Distributed processing and inference. Sampling data. Enable data-sharing across multiple organizations (hospital, insurances, pharma,…) Preserve privacy Information processing challenges to enable optimal decision making: Joint assembly and quantification across many high-throughput experiments. Dynamic model building and inference from high-dimensional data. Extract actionable information: alerts, waste reduction, monitoring effect of intervention… Modeling complex healthcare delivery systems (hospitals) as networks to enable data-driven scheduling, demand forecast, staffing, …. Scope of this tutorial

Assembly: three points of view Software engineering Computational complexity theoretic Information theoretic

Assembly as a software engineering problem A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data. Primary concerns are to minimize time and memory requirements. No guarantee on optimality of assembly quality and in fact no optimality criterion at all. Include a paper with many authors

Computational complexity view Formulate the assembly problem as a combinatorial optimization problem: Shortest common superstring (Kececioglu-Myers 95) Maximum likelihood (Medvedev-Brudno 09) Hamiltonian path on overlap graph (Nagarajan-Pop 09) Typically NP-hard and even hard to approximate. Does not address the question of when the solution reconstructs the ground truth.

Information theoretic view Basic question: What is the quality and quantity of read data needed to reliably reconstruct?

Tutorial outline De Novo DNA assembly. Reference-based DNA assembly. De Novo RNA assembly

Themes Interplay between information and computational complexity. Role of empirical data in driving theory and algorithm development.

Part I: De Novo DNA Assembly TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

Shotgun sequencing model Transition: have people thought about these basic questions before? Basic model : uniformly sampled reads. Assembly problem: reconstruct the genome given the reads.

A Gigantic Jigsaw Puzzle

Challenges Long repeats Read errors Human Chr 22 Plots of repeat statistcs Plot of errors Human Chr 22 repeat length histogram Illumina read error profile

Two-step approach First, we assume the reads are noiseless Derive fundamental limits and near-optimal assembly algorithms. Then, we add noise and see how things change.

Repeat statistics harder jigsaw puzzle easier jigsaw puzzle How exactly do the fundamental limits depend on repeat statistics?

Lower bound: coverage Introduced by Lander-Waterman in 1988. What is the number of reads needed to cover the entire DNA sequence with probability 1-²? NLW only provides a lower bound on the number of reads needed for reconstruction. NLW does not depend on the DNA repeat statistics!

Simple model: I.I.D. DNA, G ! 1 normalized # of reads reconstructable (Motahari, Bresler & Tse 12) reconstructable by greedy algorithm coverage 1 no coverage many repeats of length L no repeats of length L read length L What about for finite real DNA?

Example: human chromosome 22 (build GRCh37, G = 35M) I.I.D. DNA vs real DNA (Bresler, Bresler & Tse 12) Example: human chromosome 22 (build GRCh37, G = 35M) data i.i.d. fit Can we derive performance bounds on an individual sequence basis?

Individual sequence performance bounds (Bresler, Bresler, Tse BMC Bioinformatics 13) Given a genome s greedy deBruijn ML lower bound Lcritical simpleBridging Start with individual sequence, extract sufficient statistics, get curves repeat length multiBridging Lander-Waterman coverage Human Chr 19 Build 37

GAGE Benchmark Datasets http://gage.cbcb.umd.edu/ Rhodobacter sphaeroides Staphylococcus aureus Human Chromosome14 G = 4,603,060 G = 2,903,081 G = 88,289,540 What about the lower bound? multiBridging multiBridging lower bound multiBridging lower bound lower bound

Lower bound: Interleaved repeats Necessary condition: all interleaved repeats are bridged. L m n In particular: L > longest interleaved repeat length (Ukkonen)

Lower bound: Triple repeats Necessary condition: all triple repeats are bridged L In particular: L > longest triple repeat length (Ukkonen)

Individual sequence performance bounds (Bresler, Bresler, T. BMC Bioinformatics 13) lower bound 4. Multibridging is the algorithm we propose, which is nearly optimal, at least for chromosome 19. Did we get lucky? length Lander-Waterman coverage Human Chr 19 Build 37

Greedy algorithm Input: the set of N reads of length L (TIGR Assembler, phrap, CAP3...) Input: the set of N reads of length L Set the initial set of contigs as the reads Find two contigs with largest overlap and merge them into a new contig Repeat step 2 until only one contig remains

Greedy algorithm: first error at overlap repeat contigs bridging read already merged A sufficient condition for reconstruction: Add some animations to illustrate the two extreme cases all repeats are bridged L

longest interleaved repeats Back to chromosome 19 lower bound greedy algorithm non-interleaved repeats are resolvable! longest interleaved repeats at length 2248 longest repeat at GRCh37 Chr 19 (G = 55M)

Dense Read Model As the number of reads N increases, one can recover exactly the L-spectrum of the genome. If there is at least one non-repeating L-mer on the genome, this is equivalent information to having a read at every starting position on the genome. Key question: What is the minimum read length L for which the genome is uniquely reconstructable from its L-spectrum? Mention weight in L-spectrum

de Bruijn graph (L = 5) ATAGACCCTAGACGAT AGCC AGCG GCCC GCGA CCCT CCTA CTAG ATAG CGAT AGAC ATAGACCCTAGACGAT Sreeram: Change figure 1. Add a node for each (L-1)-mer on the genome. 2. Add k edges between two (L-1)-mers if their overlap has length L-2 and the corresponding L-mer appears k times in genome.

Eulerian path (L = 5) ATAGACCCTAGACGAT Theorem (Pevzner 95) : AGCC AGCG GCCC GCGA CCCT CCTA CTAG ATAG CGAT AGAC ATAGACCCTAGACGAT Sreeram: Change figure Theorem (Pevzner 95) : If L > max(linterleaved, ltriple) , then the de Bruijn graph has a unique Eulerian path which is the original genome.

Resolving non-interleaved repeats Condensed sequence graph non-interleaved repeat Unique Eulerian path.

From dense reads to shotgun reads [Idury-Waterman 95] [Pevzner et al 01] Idea: mimic the dense read scenario by looking at K-mers of the length L reads Construct the K-mer graph and find an Eulerian path. Success if we have K-coverage of the genome and K > Lcritical K-coverage condition and reads longer than L_ritical. Implies higher coverage than LW.

De Bruijn algorithm: performance Loss of info. from the reads! greedy deBruijn lower bound 4. Multibridging is the algorithm we propose, which is nearly optimal, at least for chromosome 19. Did we get lucky? length Lander-Waterman coverage Human Chr 19 Build 37

Resolving bridged interleaved repeats bridging read interleaved repeat Bridging read resolves one repeat and the unique Eulerian path resolves the other.

Simple bridging: performance greedy deBruijn lower bound simpleBridging 4. Multibridging is the algorithm we propose, which is nearly optimal, at least for chromosome 19. Did we get lucky? length Lander-Waterman coverage Human Chr 19 Build 37

Resolving triple repeats all copies bridged neighborhood of triple repeat triple repeat all copies bridged resolve repeat locally

Triple Repeats: subtleties

Multibridging De-Brujin Theorem: (Bresler,Bresler, Tse 13) Original sequence is reconstructable if: 1. triple repeats are all-bridged 2. interleaved repeats are (single) bridged 3. coverage Necessary conditions for ANY algorithm: triple repeats are (single) bridged interleaved repeats are (single) bridged. coverage.

Multibridging: near optimality for Chr 19 greedy deBruijn lower bound simpleBridging 4. Multibridging is the algorithm we propose, which is nearly optimal, at least for chromosome 19. Did we get lucky? length multiBridging Lander-Waterman coverage Human Chr 19 Build 37

GAGE Benchmark Datasets http://gage.cbcb.umd.edu/ Rhodobacter sphaeroides Staphylococcus aureus Human Chromosome14 G = 4,603,060 G = 2,903,081 G = 88,289,540 Lcritical = length of the longest triple or interleaved repeat. What about the lower bound? Lcritical Lcritical Lcritical multiBridging lower bound multiBridging lower bound multiBridging lower bound

Gap Sulfolobus islandicus. G = 2,655,198 triple repeat lower bound MULTIBRIDGING algorithm interleaved repeat lower bound

Complexity: Computational vs Informational Complexity of MULTIBRIDGING For a G length genome, O(G2) Alternate formulations of Assembly Shortest Common Superstring: NP-Hard Greedy is O(G), but only a 4-approximation to SCS in the worst case Maximum Likelihood: NP-Hard Key differences We are concerned only with instances when reads are informationally sufficient to reconstruct the genome. Individual sequence formulation lets us focus on issues arising only in real genomes.

Confidence When the algorithm obtains an answer, can it be sure? Under the dense read model, we can guarantee that when there is a unique Eulerian cycle, the reconstructed answer is correct. This happens whenever L > max(linterleaved, ltriple) Conversely, when L > max(linterleaved, ltriple), there are multiple reconstructions that are consistent with the observed data. Under the shotgun read model, there is ambiguity in some scenarios.

Read Errors ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT T A T A C T T A Error rate and nature depends on sequencing technology: Examples: Illumina: 0.1 – 2% substitution errors PacBio: 10 – 15% indel errors We will focus on a simple substitution noise model with noise parameter p.

Consistency Basic question: What is the impact of noise on Lcritical? This question is equivalent to whether the L-spectrum is exactly recoverable as the number of noisy reads N -> 1. Theorem (C.C. Wang 13): Yes, for all p except p = ¾.

What about coverage depth? Theorem (Motahari, Ramchandran,Tse, Ma 13): Assume i.i.d. genome model. If read error rate p is less than a threshold, then Lander-Waterman coverage is sufficient for L > Lcritical For uniform distr. on {A,G,C,T}, threshold is 19%. A separation architecture is optimal: error correction assembly

Why? noise averaging Coverage means most positions are covered by many reads. Multiple aligning overlapping noisy reads is possible if Assembly using noiseless reads is possible if M

From theory to practice Two issues: Multiple alignment is performed by testing joint typicality of M sequences, computationally too expensive. Solution: use the technique of finger printing. 2) Real genomes are not i.i.d. Solution: replace greedy by multibridging.

X-phased multibridging Lam, Khalak, T. Recomb-Seq 14 Prochlorococcus marinus Lcritical Substitution errors of rate 1.5 %

More results Prochlorococcus marinus Helicobacter pylori Lcritical Lcritical Methanococcus maripaludis Mycoplasma agalactiae Lcritical Lcritical

A more careful look Mycoplasma agalactiae Lcritical-approx Lcritical

Approximate repeat example: Yersinia pestis exact triple repeat, length 1662 5608 approximate triple repeat length

Application: finishing tool for PacBio reads PacBio Assembler HGAP raw_reads.fasta contigs.fasta Our finishingTool raw_reads.fasta contigs.fasta improved_contigs.fasta https://github.com/kakitone/finishingTool

Experimental results Before After Escherichia coli Meiothermus ruber Pedobacter heparinus

More detail of the result Species Before [Ncontigs] After [Ncontigs] % Match with reference Time Size Escherichia coli (MG 1655) 21 7 [finisherSC] 99.60 < 3 mins (laptop) ~ 4.6M Meiothermus ruber (DSM 1279) 3 1 99.99 < 1 min (laptop) ~ 3.0M Pedobacter heparinus (DSM 2366) 18 5 99.89 < 3 mins ~ 5.1M S_cerivisea (fungus) 252 78 [finisherSC] 95.46 < 3 hours (laptop) ~ 12.4M S_cerivisea (fungus) 55 [Greedy] 53.91

Part II: Reference-Based DNA Assembly (Mohajer, Kannan, Tse ‘14) TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

Many genomes to sequence… 100 million species (e.g. phylogeny) 7 billion individuals (SNP, personal genomics) … but not all independent 1013 cells in a human (e.g. somatic mutations such as HIV, cancer) courtesy: Batzoglou

Reference Based Assembly: Formulation ACGTCCCATGCGTATGCATAATGCCACATATGGCTATGCGTAATGAGTACC Target ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTACC Side Information Assembler

Types of Variations Substitutions (Single Nucleotide Polymorphisms: SNP) Reference ACGTCCCATGCGTATGCATAATGCCACATATGGCTATGCGTAATGAGTACC Target ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTACC

Types of Variations Small Indels (Insertions and Deletions) Reference ACGTCCATGCGTATGCTAATGCCACATATTGAGCTATGCGTAATGCTGTACC ACGTCC___ATGCGTATGC_TAATGCCACATATTGAGCTATGCGTAATGCTGTACC Target ACGTCCTAGATGCGTATGCGTAATGCCACATATGCTATGCGTAATGGTACC ACGTCCTAGATGCGTATGCGTAATGCCACATAT___GCTATGCGTAATG__GTACC

Types of Variations Structural Variation Reference Inversion Duplication Duplication (dispersed) Copy Number Variation

Mathematical Formulation Focus on SNP version Define SNP rate Noiseless reads What is Lcritical for this problem? Want exact reconstruction Algorithm r (Reference DNA) SNP Rate Reads from target t Estimate of Target DNA Dense

Mathematical Formulation For any given reference DNA and SNP rate, what is the read length required for reconstruction? In the worst case among target DNA sequences Lcritical is a function of r, SNP rate Dense Reads from target t r (Reference DNA) Algorithm Estimate of Target DNA SNP Rate

Necessary Conditions Let the reference DNA have a repeat of size lrep > 2L r lrep lrep Consider two possible target DNA sequences t1 and t2 L L t1 t2 Since L < lrep /2, the two targets D1 and D2 indistinguishable from reads Sanity check: interleaved repeat of length lrep /2 in D1 and D2

Necessary Conditions Let the reference DNA have an approximate repeat of size lrep,app > 2L r Can create r’ close to r but having exact repeat of size lrep,app r’ t1 t2 If L < lrep,app / 2: the two possible targets t1 and t2 indistinguishable Tolerance for approximate repeat depends on SNP rate

Algorithm lrep,app lrep,app r t Map reads to r Let L > lrep,app / 2 t Map reads to r Keep only uniquely mapped reads Estimate t r ť

Condition for Success Loci covered by uniquely mapped reads are correctly called. Algorithm fails at a particular locus => None of the (L-1) possible reads uniquely mapped 2L 2L Case 1 Case 2 r Second case more typical in real genome => 2L length approximate repeat in r L > lrep,app / 2 => The algorithm succeeds. Tolerance for approximate repeat depends on SNP rate

Assembly Vs. Alignment: I Necessary condition L ≥ lrep,app (r) / 2 Sufficient condition L > lrep,app (r) / 2 (subject to the assumption) => Alignment near optimal and Lref = lrep,app (r) / 2. De Novo algorithm achieves Lcrit (t) = max {linterleaved(t), ltriple(t) } In terms of r, for worst case t Lde-novo = max {linterleaved,app (r), ltriple,app (r)} Tolerance for approximate repeat depends on SNP rate

Assembly Vs. Alignment: II Clearly Lde-novo ≥ Lref since Lref is necessary. Lde-novo = max {linterleaved,app (r), ltriple,app (r)} ≤ lrep,app(r) = 2 Lref Thus gain from reference is at-most a factor of 2 in the read length. The maximal gain happens when linterleaved,app (r) = lrep,app (r), i.e., when the largest approximate repeat is an interleaved repeat. This happens for example, when the DNA is an i.i.d. sequence Tolerance for approximate repeat depends on SNP rate

Reference based Assembly: Reprise Complexity of alignment Very fast aligners using fingerprinting available when SNP rate small Better than alignment ? Theory shows alignment near optimal But alignment is what everyone uses anyway Nothing better is possible? The limitations of the worst case formulation! If we adopt a individual sequence analysis for both reference and target, better solution possible.

Part III: RNA (Transcriptome) Assembly Kannan, Pachter, Tse Genome Informatics ‘13 TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

RNA: The RAM in Cells transcription translation The instructions from DNA are copied to mRNA transcripts by transcription RNA transcripts captures dynamics of cell RNA Sequencing: Importance Clinical purposes Research: Discovery of novel functions Understanding gene regulation Most popular *-Seq DNA RNA Protein transcription translation

Alternative splicing DNA Exon Intron RNA Transcript 2 RNA Transcript 1 AC TGAA AGC DNA ATAC GAAT CAAT TCAG Exon Intron 1000’s to 10,000’s symbols long ATAC CAAT TCAG GAAT TCAG RNA Transcript 1 RNA Transcript 2 Alternative splicing yields different isoforms.

RNA-Seq Reads ATAC CAAT TCAG TCA ATAC CAAT TCAG GAAT TCAG ATT GAAT (Mortazavi et al, Nature Methods 08) Reads ATAC CAAT TCAG TCA ATAC CAAT TCAG Assembler reconstructs GAAT TCAG ATT GAAT TCAG GAA GAAT TCAG Existing Assemblers Genome guided: Cufflinks, Scripture, Isolasso,.. De novo: Trinity, Oasis, TransAbyss,…

RNA Sequencing: Bottleneck Popular assemblers diverge significantly when fed the same input 24243 7553 9741 6457 448216 59647 5588 IsoLasso Scripture Cufflinks Is the bottleneck informational or computational or neither? Source: Wei Li et al, JCB 2011, Data from ENCODE project

Informational Limits Lcritical for transcriptome assembly No algo. can Read Length, L Lcritical No algo. can reconstruct Proposed algo. can reconstruct in linear time On many examples, these two bounds match, establishing Lcritical ! Mouse transcriptome: Lcritical = 4077 revealing complex transcriptome structure What can we do at practical values of L?

Near-Optimality at Practical L Fraction of Transcripts Reconstructable Read Length Read Length

Near-Optimality at Practical L Fraction of Transcripts Reconstructable Upper bound without abundance Upper bound on any algorithm Upper Bound Read Length Read Length

Near-Optimality at Practical L Fraction of Transcripts Reconstructable Proposed Algorithm Read Length Read Length

Necessity of Abundance Information Fraction of Transcripts Reconstructable Upper bound without abundance Upper bound without abundance diversity Read Length Read Length

Transcriptome Assembly: Formulation M transcripts s1,..,sM with relative abundances α1,..,αM which are generic (rationally independent). Dense read model: Look at Lcrit Get all substrings of length L along with their relative weights . s1 s2 sM α1 α2 αM α1+α2 αM

What is Lcritical for transcriptome? Lcritical is lower bounded by the length of the longest interleaved repeat in any transcript It can potentially be much larger due to inter-transcript repeats of exons across isoforms. ATAC CAAT TCAG GAAT TCAG

The Information Bottleneck s1 s3 s4 s2 s5 s1 s3 s4 s2 s5

The Information Bottleneck s4 s4 s1 s3 s1 s3 s5 s5 s2 s3 s2 s3

The Information Bottleneck s5 s1 s3 s2 s4 s5 s1 s3 s4 s2 s3 Unless L > s3 these two transcriptomes are confused

The Information Bottleneck s1 s3 s4 s5 s1 s3 s4 s2 s2 s3 s5 Sparsity can help rule out this four transcript alternative But first two possibilities still confusable unless L > s3

How to Distinguish the Two

lymphoblastoid cell line Abundance diversity lymphoblastoid cell line Geuvadis dataset

Abundance Diversity s4 s1 s3 s5 s1 s3

Abundance Diversity s5 s4 s1 s3 s1 s3 s5 s4 s1 s3 s1 s3 This transcriptome is not a viable alternative (non-uniform coverage) Even if L < s3 these transcriptomes are distinguishable.

Fooling Set under Abundance Diversity a+c b-c Fooling Set under Abundance Diversity a s1 s2 s3 s1 s2 s4 b s4 s5 s2 s3 c These two transcriptomes are still confusable if L < s2

Achievability: Algorithm From the reads we construct a transcript graph 0.1 Reads ATCCA ATCCA GATTC GATTC ATTCG ATTCG 0.3 0.3 TCCAT TCCAT 0.3 0.3 CCATT CATTC CATTC Weight edges based on relative frequencies

Achievability: Algorithm From the reads, we construct a transcript graph 0.1 Reads ATCCA GATTC ATTCG 0.3 0.3 TCCAT 0.3 0.3 CCATT CATTC Weight edges based on relative frequencies

Achievability: Algorithm From the reads, we construct a transcript graph 0.1 Reads ATC GAT TCG 0.3 0.3 CAT Weight edges based on relative frequencies

Transcripts from Graph Paths correspond to transcripts Naïve Algorithm: Output all paths from the graph GAT TCG GAT 0.1 ATC TCG 0.3 0.3 CAT ATC CAT TCG

Utility of Abundance Consider the following splice-graph Not all paths are transcripts Node frequencies give abundance information First idea: Use continuity of copy counts 0.12 s1 s3 s4 s1 0.12 s4 0.12 s3 0.88 s2 0.88 0.88 s5 s2 s3 s5

Utility of Abundance: Beyond Continuity s0 s3 s4 5 More complex splice graphs: s0 s3 s5 s0 s1 s3 s4 s5 12 9 5 7 s2 6 s6 15 7 9 s1 s3 s6 6 s2 s3 s6 In general, we are given values on nodes /edges. Need to find sparsest flow (on fewest paths).

General Splice graphs Principle for general splice graphs: Find the smallest set of paths that corresponds to the node / edge copy counts Network routing, snooping, societal networks How to split a flow? Edge-flow: Flow value on each edge (satisfying conservation) Path-flow: Flow value on each path Given a edge-flow, find the sparsest path flow 0.12 s1 s4 0.12 0.12 0.12 0.12 Start End s3 0.88 s2 0.88 0.88 s5 0.88 0.88

Sparsest Flow Decomposition Problem is NP-Hard. [Vatinlen et al’ 08, Hartman et al ’12] Closer look at hard instances: most paths have same flow Equivalent to: Most transcripts have same abundance (!) This is not characteristic of the biological problem Our Result: Assume that abundances are generic Propose a provably correct algorithm that reconstructs when: L > Lsuff Algorithm is linear time under this condition Approximately satisfied by biological data !

Iterative Algorithm The algorithm locally resolves paths using abundance diversity Error propagation? Decompose a node only when sure If unsure, decompose other nodes before coming back to this node The algorithm solves paths like a sudoku puzzle Solving one node can help uniquely resolve other nodes! Can analyze conditions for correct recovery L > Lsuff

Algorithm: Example Run 4 6 7 a+b a b 3 5 1 2 b+c c 4 6 7 a+b a b 3 5 1 2 b+c c 46 47 a b 3 5 1 2 b+c c 46 47 a b 3 5 1 2 b+c c 1346 235 2347 a b c 346 35 a c 1 347 b 2

Practical Implementation Multibridging to construct transcript graph Condensation and intra-transcript repeat resolution Identify and discard sequencing errors Aggregate abundance estimation Node-wise copy count estimates Smoothing CC estimates using min-cost network flow Transcripts as paths Sparsest decomposition of edge-flow into paths Deals with inter-transcript repeats

Practical Performance Simulated reads from human chromosome 15, Gencode transcriptome Hard test case 1700 transcripts chosen randomly from Chr 15 Abundance generated from log-uniform distribution Read length=100, 1 Million reads 1% error rates Single-end reads / stranded protocol

Practical Performance Fraction of Transcripts Missed False Positives Coverage Depth of Transcripts

Complexity Sparsest flow problem known to be NP-Hard Can show using similar reduction that RNA-Seq problem under dense reads is also NP-Hard, assuming arbitrary abundances Reasons why our formulation leads to poly-time algorithm: Our assumption that abundances are generic Only worry about instances where there is enough information Individual sequence formulation lets us focus on issues arising only in real genomes.

Confidence Can we be sure when the produced solution is correct? Assume dense read model We are finding the sparsest set of transcripts that satisfy the given L spectrum Under the assumption of genericity Theorem: If the sparsest solution is unique, then it is the only generic solution satisfying the L-spectrum (!) s1 s2 s3 s4 s5 0.12 0.88

Summary An approach to assembly design based on principles of information theory. Driven by and tested on genomics and transcriptomics data. Ultimate goal is to build robust, scalable software with performance guarantees.

Problem Landscape measure manage utilize data Assembly (de Novo) Noisy reads RNA: Finite N Variant calling (reference-based assembly) Indels Large variants Metagenomic assembly Genome wide association studies Information bounds Phylogenetic tree reconstruction Pathogen detection Compression Compress memory? Privacy Information theoretic methods? Engineering challenges to manage massive data sets: Compression for storage, communication and retrieval. Distributed processing and inference. Sampling data. Enable data-sharing across multiple organizations (hospital, insurances, pharma,…) Preserve privacy Information processing challenges to enable optimal decision making: Joint assembly and quantification across many high-throughput experiments. Dynamic model building and inference from high-dimensional data. Extract actionable information: alerts, waste reduction, monitoring effect of intervention… Modeling complex healthcare delivery systems (hospitals) as networks to enable data-driven scheduling, demand forecast, staffing, ….

Acknowledgements DNA Assembly RNA Assembly Abolfazl Motahari Sharif Soheil Mohajer Guy Bresler MIT Lior Pachter Berkeley Ma’ayan Bresler Berkeley Joseph Hui Berkeley Kayvon Mazooji Berkeley Eren Sasoglu Ka Kit Lam Berkeley Asif Khalak Pacific Biosciences