The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,

Slides:



Advertisements
Similar presentations
Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang Institute of Applied Mathematics,
Advertisements

Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Next Generation Sequencing, Assembly, and Alignment Methods
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
CISC667, F05, Lec4, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Whole genome sequencing Mapping & Assembly.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.
CS273a Lecture 1, Autumn 10, Batzoglou DNA Sequencing.
DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
High Throughput Sequencing: Microscope in the Big Data Era
RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.
High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.
Genome sequencing and assembling
Gaussian Interference Channel Capacity to Within One Bit David Tse Wireless Foundations U.C. Berkeley MIT LIDS May 4, 2007 Joint work with Raul Etkin (HP)
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Metagenomics Assembly Hubert DENISE
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof.
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
billion-piece genome puzzle
GigAssembler. Genome Assembly: A big picture
DNA Sequencing.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
CISC667, S07, Lec4, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Whole genome sequencing Mapping & Assembly.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
The Haplotype Blocks Problems Wu Ling-Yun
UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.
Information Theory of High-throughput Shotgun Sequencing David Tse Dept. of EECS U.C. Berkeley Tel Aviv University June 4, 2012 Research supported by NSF.
The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley UBC September 14, 2012 Research supported by NSF.
Lesson: Sequence processing
Assembly algorithms for next-generation sequencing data
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CSCI2950-C Genomes, Networks, and Cancer
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Science of Information: Case Studies in DNA and RNA assembly
Research in Computational Molecular Biology , Vol (2008)
How to Solve NP-hard Problems in Linear Time
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
2nd (Next) Generation Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14, 2012 Research supported by NSF Center for Science of Information.

Communication: the beginning Prehistoric: smoke signals, drums. 1837: telegraph 1876: telephone 1897: radio 1927: television Communication design tied to the specific source and specific physical medium.

Grand Unification Shannon 48 Theorem: Model all sources and channels statistically. A unified way of looking at all communication problems in terms of information flow. source reconstructed source

60 Years Later All communication systems are designed based on the principles of information theory. A benchmark for comparing different schemes and different channels. Suggests totally new ways of communication (eg. MIMO, opportunistic communication).

Secrets of Success Information, then computation. It took 60 years, but we got there. Simple models, then complex. The discrete memoryless channel ………… is like the Holy Roman Empire.

Looking Forward Can the success of this way of thinking be broadened to other fields?

Information Theory of DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

DNA sequencing A basic workhorse of modern biology and medicine. Problem: to obtain the sequence of nucleotides. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… courtesy: Batzoglou

Impetus: Human Genome Project 1990: Start 2001 : Draft 2003: Finished 3 billion nucleotides courtesy: Batzoglou 3 billion $$$$

Sequencing gets cheaper and faster Cost of one human genome HGP:$ 3 billion 2004: $30,000, : $100, : $10, : $4, : $1,000 ???: $300 Time to sequence one genome: years  days Massive parallelization. courtesy: Batzoglou

But many genomes to sequence 100 million species (e.g. phylogeny) 7 billion individuals (SNP, personal genomics) cells in a human (e.g. somatic mutations such as HIV, cancer) courtesy: Batzoglou

Whole Genome Shotgun Sequencing Reads are assembled to reconstruct the original DNA sequence. Number of reads read length L ¼ N ¼ 10 8 genome length G ¼ 10 9

A Gigantic Jigsaw Puzzle

Many Sequencing Technologies HGP era: single technology (Sanger) Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.) Each technology has different read length, noise statistics, etc Eg.: Illumina: L = 50 to 200, error ~ 1 % substitution Pac Bio: L = 2000 to 4000, error ~ 10-15% indels

Many assembly algorithms Source: Wikipedia

And many more……. A grand total of 42!

Computational View “Since it is well known that the assembly problem is NP- hard, …………” algorithm design based largely on heuristics no optimality or performance guarantees But NP-hardness does not mean it is hopeless to be close to optimal. Can we first define optimality without regard to computational complexity?

Information theoretic view Given a statistical model, what is the read length L and number of reads N needed to reconstruct with probability 1-ε ? Are there computationally efficient assembly algorithms that perform close to the fundamental limits? Open questions!

Reads are uniformly sampled from the DNA sequence. Read process is noiseless. Impact of noise: later. A basic read model

Coverage Analysis Pioneered by Lander-Waterman in What is the number of reads needed to cover the entire DNA sequence with probability 1-²? N cov only provides a lower bound on the number of reads needed for reconstruction. N cov does not depend on the DNA statistics!

Repeat statistics do matter! easier jigsaw puzzle harder jigsaw puzzle How exactly do the fundamental limits depend on repeat statistics?

reconstructable by greedy algorithm Simple model: I.I.D. DNA, G ! 1 (Motahari, Bresler & T. 12) read length L 1 many repeats of length L no repeats of length L normalized # of reads coverage no coverage What about for finite real DNA?

i.i.d. fit data I.I.D. DNA vs real DNA Example: human chromosome 22 (build GRCh37, G = 35M) (Bresler, Bresler & T. 12) Can we derive performance bounds directly in terms of empirical repeat statistics?

Lower bound: Interleaved repeats Necessary condition: all interleaved repeats are bridged. L m m n n In particular: L > longest interleaved repeat length (Ukkonen)

Lower bound: Triple repeats Necessary condition: all triple repeats are bridged In particular: L > longest triple repeat length (Ukkonen) L

Chromosome 22 (Lower Bound) GRCh37 Chr 22 (G = 35M) triple repeat interleaved repeat coverage what is achievable?

Greedy algorithm (TIGR Assembler, phrap, CAP3...) Input: the set of N reads of length L 1.Set the initial set of contigs as the reads 2.Find two contigs with largest overlap and merge them into a new contig 3.Repeat step 2 until only one contig remains

Greedy algorithm: first error at overlap A sufficient condition for reconstruction: repeat bridging read already merged contigs all repeats are bridged L

Chromosome 22 GRCh37 Chr 22 (G = 35M) greedy algorithm lower bound

longest interleaved repeats at length 2248 lower bound longest repeat at Chromosome 19 GRCh37 Chr 19 (G = 55M) greedy algorithm non-interleaved repeats are resolvable!

de Bruijn graph ATAGCCCTAGCGAT [Idury-Waterman 95] [Pevzner et al 01] (K = 4) TAGC AGCC AGCG GCCC GCGA CCCT CCTA CTAG ATAG CGAT 1. Add a node for each K-mer in a read 2. Add edges for adjacent K-mers

Resolving non-interleaved repeats non-interleaved repeat Unique Eulerian path.

Resolving bridged interleaved repeats interleaved repeat bridging read Bridging read resolves one repeat and the unique Eulerian path resolves the other.

Resolving triple repeats triple repeat all copies bridged neighborhood of triple repeat all copies bridged resolve repeat locally

Multibridging De-Brujin Theorem: Original sequence is reconstructable if: 2. interleaved repeats are (single) bridged 3. coverage 1. triple repeats are all-bridged Necessary conditions for ANY algorithm: 1.triple repeats are (single) bridged 1.interleaved repeats are (single) bridged. 2.coverage. (Bresler, Bresler & T. 12)

longest interleaved repeats at length 2248 lower bound longest repeat at Chromosome 19 GRCh37 Chr 19 (G = 55M) De-brujin algorithm close to optimal triple repeat

GAGE Benchmark Datasets Staphylococcus aureus i.i.d. fit data Rhodobacter sphaeroides G = 4,603,060 G = 2,903,081 G =88,289,540 Human Chromosome14

Gap Select a good example that shows the worst case gap and transition window size, and give the expressions. Plot only interleaved lower bound, triple lower bound (dashed) and best upper bd. Sulfolobus islandicus. G = 2,655,198 triple repeat lower bound interleaved repeat lower bound De-Brujin algorithm

Read Noise ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT T A T A CTT A Illumina noise profile Each symbol corrupted by a noisy channel.

Erasures on i.i.d. uniform DNA Theorem: If the erasure probability is less than 1/3, then noiseless performance can be achieved. A separation architecture is optimal: (Ma, Motahari, Ramchandran & T. 12) error correction assembly

Why? Coverage means most positions are covered by many reads. Aligning noisy reads locally is easier than assembling noiseless reads globally for p erasure < 1/3. noise averaging

Conclusions A systematic approach to assembly design based on information. More powerful than just computational complexity considerations. Simple models are useful for initial insights but a data-driven approach yields a more complete picture.

TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Collaborators Acknowledgments Abolfazl Motahari Guy Bresler Ma’ayan Bresler Nan Ma Kannan Ramchandran Yun Song Lior Pachter Serafim Batzoglou