The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley UBC September 14, 2012 Research supported by NSF.

Slides:

Advertisements

Similar presentations

Design of Experiments Lecture I

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Information Theory EE322 Al-Sanie.

Paul Cuff THE SOURCE CODING SIDE OF SECRECY TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA.

How to do control theory and discrete maths at the same time?

Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.

Chapter 6 Information Theory

On Systems with Limited Communication PhD Thesis Defense Jian Zou May 6, 2004.

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.

Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.

Coloring random graphs online without creating monochromatic subgraphs Torsten Mütze, ETH Zürich Joint work with Thomas Rast (ETH Zürich) and Reto Spöhel.

Lattices for Distributed Source Coding - Reconstruction of a Linear function of Jointly Gaussian Sources -D. Krithivasan and S. Sandeep Pradhan - University.

DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.

CS273a Lecture 1, Autumn 10, Batzoglou DNA Sequencing.

CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)

DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.

Computational Genomics Lecture 1, Tuesday April 1, 2003.

High Throughput Sequencing: Microscope in the Big Data Era

RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.

High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.

DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.

Gaussian Interference Channel Capacity to Within One Bit David Tse Wireless Foundations U.C. Berkeley MIT LIDS May 4, 2007 Joint work with Raul Etkin (HP)

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.

Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.

Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.

Channel Polarization and Polar Codes

Coding Schemes for Multiple-Relay Channels 1 Ph.D. Defense Department of Electrical and Computer Engineering University of Waterloo Xiugang Wu December.

The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,

De-novo Assembly Day 4.

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

CS 394C March 19, 2012 Tandy Warnow.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

When rate of interferer’s codebook small Does not place burden for destination to decode interference When rate of interferer’s codebook large Treating.

Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.

Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Thrust 2 Layerless Dynamic Networks Lizhong Zheng, Todd Coleman.

Channel Capacity

Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Richard W. Hamming Learning to Learn The Art of Doing Science and Engineering Session 13: Information Theory ` Learning to Learn The Art of Doing Science.

CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.

Channel Capacity.

SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.

Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.

Coding Theory Efficient and Reliable Transfer of Information

billion-piece genome puzzle

Fundamental Limitations of Networked Decision Systems Munther A. Dahleh Laboratory for Information and Decision Systems MIT AFOSR-MURI Kick-off meeting,

DNA Sequencing.

Turbo Codes. 2 A Need for Better Codes Designing a channel code is always a tradeoff between energy efficiency and bandwidth efficiency. Lower rate Codes.

A Low-Complexity Universal Architecture for Distributed Rate-Constrained Nonparametric Statistical Learning in Sensor Networks Avon Loy Fernandes, Maxim.

Using Feedback in MANETs: a Control Perspective Todd P. Coleman University of Illinois DARPA ITMANET TexPoint fonts used.

Chapter 5 Sequence Assembly: Assembling the Human Genome.

Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.

UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.

UNIT –V INFORMATION THEORY EC6402 : Communication TheoryIV Semester - ECE Prepared by: S.P.SIVAGNANA SUBRAMANIAN, Assistant Professor, Dept. of ECE, Sri.

Information Theory of High-throughput Shotgun Sequencing David Tse Dept. of EECS U.C. Berkeley Tel Aviv University June 4, 2012 Research supported by NSF.

On the path-avoidance vertex-coloring game Torsten Mütze, ETH Zürich Joint work with Reto Spöhel (MPI Saarbrücken) TexPoint fonts used in EMF. Read the.

Lesson: Sequence processing

Nearly optimal classification for semimetrics

CSCI2950-C Genomes, Networks, and Cancer

Reasoning about Performance in Competition and Cooperation

Science of Information: Case Studies in DNA and RNA assembly

How to Solve NP-hard Problems in Linear Time

General Strong Polarization

Presentation transcript:

The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley UBC September 14, 2012 Research supported by NSF Center for Science of Information. Joint work with A. Motahari, G. Bresler, M. Bresler. Acknowledgements: S. Batzolgou, L. Pachter, Y. Song TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

Communication: the beginning Prehistoric: smoke signals, drums. 1837: telegraph 1876: telephone 1897: radio 1927: television Communication design tied to the specific source and specific physical medium.

Grand Unification Shannon 48 Theorem: Model all sources and channels statistically. A unified way of looking at all communication problems in terms of information flow. source reconstructed source

60 Years Later All communication systems are designed based on the principles of information theory. A benchmark for comparing different schemes and different channels. Suggests totally new ways of communication.

Secrets of Success Information, then computation. It took 60 years, but we got there. Simple models, then complex. The discrete memoryless channel ………… is like the Holy Roman Empire. Infinity, and then back. Allows us to think in terms of typical behavior.

TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Tom Cover, 1990 “Asymptotic limit is the first term in the Taylor series expansion at infinity. And theory is the first term in the Taylor series of practice.”

Looking Forward Can the success of this way of thinking be broadened to other fields?

Information Theory of DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

DNA sequencing DNA: the blueprint of life Problem: to obtain the sequence of nucleotides. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… courtesy: Batzoglou

Impetus: Human Genome Project 1990: Start 2001 : Draft 2003: Finished 3 billion nucleotides courtesy: Batzoglou 3 billion $$$$

Sequencing gets cheaper and faster Cost of one human genome HGP:$ 3 billion 2004: $30,000, : $100, : $10, : $4, : $1,000 ???: $300 Time to sequence one genome: years  days Massive parallelization. courtesy: Batzoglou

But many genomes to sequence 100 million species (e.g. phylogeny) 7 billion individuals (SNP, personal genomics) cells in a human (e.g. somatic mutations such as HIV, cancer) courtesy: Batzoglou

Whole Genome Shotgun Sequencing Reads are assembled to reconstruct the original DNA sequence.

A Gigantic Jigsaw Puzzle

Many Sequencing Technologies HGP era: single technology (Sanger) Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.) Each technology has different read length, noise statistics, etc

Many assembly algorithms Source: Wikipedia

And many more……. A grand total of 42! Why is there no single optimal algorithm?

An information theoretic question What is the minimum number of reads required for reliable reconstruction? A benchmark for comparing different algorithms and different technologies. An open question!

Coverage Analysis Pioneered by Lander-Waterman What is the minimum number of reads to ensure there is no gap between the reads with a desired prob.? Only provides a lower bound on the minimum number of reads to reconstruct. Clearly not tight.

Communication and Sequencing: An Analogy Communication: Sequencing: Question: what is the max. sequencing rate such that reliable reconstruction is asymptotically possible? source sequence

A Simple Model DNA sequence: i.i.d. with marginal distribution p=(p A, p G, p T, p C ). Starting positions of reads: uniform on the DNA sequence. Read process: noiseless. Will build on this to look at statistics from genomic data.

The read channel Capacity depends on –read length: L –DNA length: G Normalized read length: Eg. L = 100, G = 3 x 10 9 : read channel AGCTTATAGGTCCGCATTACC AGGTCC L G

Result: Sequencing Capacity Renyi entropy of order 2 The higher the entropy, the easier the problem! coverage constraint

Complexity is in the eye of the beholder Low entropy High entropy harder to compress easier jigsaw puzzle harder jigsaw puzzle easier to compress

A necessary condition for reconstruction interleaved repeats of length None of the copies is straddled by a read (unbridged). Reconstruction is impossible! Special cases: Under i.i.d. model, greedy is optimal for mixed reads.

A sufficient condition: greedy algorithm Input: the set of N reads of length L 1.Set the initial set of contigs as the reads 2.Find two contigs with largest overlap and merge them into a new contig 3.Repeat step 2 until only one contig remains Algorithm progresses in stages: at stage contig merge reads at overlap TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA overlap

Greedy algorithm: stage merge mistake => there must be a -repeat and each copy is not bridged by a read. A sufficient condition for reconstruction: There are no unbridged - repeats for any. repeat bridging read already merged contigs

Summary Necessary condition for reconstruction: No unbridged interleaved -repeats for any. Sufficient condition (via greedy algorithm) No unbridged -repeats for any. For the i.i.d. DNA model: 1) no unbridged interleaved repeats => w.h.p. no unbridged repeats. 2) Errors occur typically when either

Capacity Result Explained interleaved (L-1)- repeats L-1 coverage constraint

Comparison of assembly algorithms greedy sequential De Brujin graph

Achievable rates of algorithms sequential algorithm De Brujin graph algorithm

Critical phenomenon coverage-limited regime repeat-limited regime Question: Is this clean state of affairs tied to the i.i.d. DNA model?

i.i.d. fit data I.I.D. DNA vs real DNA Example: human chromosome 22 (build GRCh37, G = 35M)

Greedy algorithm: general DNA statistics Probability of failure at stage depends on: –# of - repeats –probability of not bridging a - repeat Necessary condition translates similarly to an upper bound on capacity: ? ?

upper bound longest repeat at upper bound Sequencing capacity on Chr 22 statistics GRCh37 Chr 22 (G = 35M) longest interleaved repeats at 2325 critical phenomenon still occurs

longest interleaved repeats at length 2248 upper bound longest repeat at Chromosome 19 GRCh37 Chr 19 (G = 55M) A hybrid greedy/De Brujin graph algorithm closes the gap and is near optimal on all 22 chromosomes.

Ongoing work Noisy reads. Reference-based assembly. Partial reconstruction. ……...

Conclusion Information theory has made a huge impact on communication. Its success stems from focusing on something fundamental: information. This philosophy may be useful for other important engineering problems. DNA sequencing is a good example.