High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.

High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong July 7, 2014 Research supported by NSF Center for Science of Information.

DNA sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

High throughput sequencing revolution tech. driver for communications

Shotgun sequencing read

Technologies Sequence r Sanger 3730xl 454 GSIon Torrent SOLiDv4Illumina HiSeq 2000 Pac Bio Mechanis m Dideoxy chain terminatio n Pyrose quencin g Detection of hydrogen ion Ligation and two- base coding Reversi ble Nucleoti des Single molecule real time Read length 400-900 bp 700 bp~400 bp50 + 50 bp 100 bp PE 1000~1000 0 bp Error Rate0.001%0.1%2%0.1%2%10-15% Output data (per run) 100 KB1 GB100 GB 1 TB10 GB

High throughput sequencing: Microscope in the big data era Genomic variations, 3-D structures, transcription, translation, protein interaction, etc. The quantities measured can be dynamic and vary spatially. Example: RNA expression is different in different tissues and at different times.

Computational problems for high throughput data measure data manage data utilize data Assembly (de Novo) Variant calling (reference-based assembly) Compression Privacy Genome wide association studies Phylogenetic tree reconstruction Pathogen detection Scope of this tutorial

Assembly: three points of view Software engineering Computational complexity theoretic Information theoretic

Assembly as a software engineering problem A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data. Primary concerns are to minimize time and memory requirements. No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

Computational complexity view Formulate the assembly problem as a combinatorial optimization problem: –Shortest common superstring (Kececioglu-Myers 95) –Maximum likelihood (Medvedev-Brudno 09) –Hamiltonian path on overlap graph (Nagarajan-Pop 09) Typically NP-hard and even hard to approximate. Does not address the question of when the solution reconstructs the ground truth.

Information theoretic view Basic question: What is the quality and quantity of read data needed to reliably reconstruct?

Tutorial outline I. De Novo DNA assembly. I. De Novo RNA assembly

Themes Interplay between information and computational complexity. Role of empirical data in driving theory and algorithm development.

Part I: De Novo DNA Assembly TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

Shotgun sequencing model Basic model : uniformly sampled reads. Assembly problem: reconstruct the genome given the reads.

A Gigantic Jigsaw Puzzle

Challenges Long repeats Human Chr 22 repeat length histogram Illumina read error profile Read errors

Two-step approach First, we assume the reads are noiseless Derive fundamental limits and near-optimal assembly algorithms. Then, we add noise and see how things change.

Repeat statistics easier jigsaw puzzle harder jigsaw puzzle How exactly do the fundamental limits depend on repeat statistics?

Lower bound: coverage Introduced by Lander-Waterman in 1988. What is the number of reads needed to cover the entire DNA sequence with probability 1-²? N LW only provides a lower bound on the number of reads needed for reconstruction. N LW does not depend on the DNA repeat statistics!

reconstructable by greedy algorithm Simple model: I.I.D. DNA, G ! 1 (Motahari, Bresler & Tse 12) read length L 1 many repeats of length L no repeats of length L normalized # of reads coverage no coverage What about for finite real DNA?

i.i.d. fit data I.I.D. DNA vs real DNA Example: human chromosome 22 (build GRCh37, G = 35M) (Bresler, Bresler & Tse 12) Can we derive performance bounds on an individual sequence basis?

GREEDY DE B RUIJN SIMPLE B RIDGING MULTI B RIDGING Lander-Waterman coverage lower bound Individual sequence performance bounds repeat length Human Chr 19 Build 37 (Bresler, Bresler, Tse BMC Bioinformatics 13) L critical Given a genome s

Rhodobacter sphaeroides GAGE Benchmark Datasets Staphylococcus aureus G = 4,603,060 G = 2,903,081 G = 88,289,540 Human Chromosome14 http://gage.cbcb.umd.edu/ MULTI B RIDGING lower bound MULTI B RIDGING lower bound MULTI B RIDGING lower bound

Lower bound: Interleaved repeats Necessary condition: all interleaved repeats are bridged. L m m n n In particular: L > longest interleaved repeat length (Ukkonen)

Lower bound: Triple repeats Necessary condition: all triple repeats are bridged In particular: L > longest triple repeat length (Ukkonen) L

length Lander-Waterman coverage lower bound Individual sequence performance bounds Human Chr 19 Build 37 (Bresler, Bresler, T. BMC Bioinformatics 13)

Greedy algorithm (TIGR Assembler, phrap, CAP3...) Input: the set of N reads of length L 1.Set the initial set of contigs as the reads 2.Find two contigs with largest overlap and merge them into a new contig 3.Repeat step 2 until only one contig remains

Greedy algorithm: first error at overlap A sufficient condition for reconstruction: repeat bridging read already merged contigs all repeats are bridged L

longest interleaved repeats at length 2248 lower bound longest repeat at Back to chromosome 19 GRCh37 Chr 19 (G = 55M) greedy algorithm non-interleaved repeats are resolvable!

Dense Read Model As the number of reads N increases, one can recover exactly the L-spectrum of the genome. If there is at least one non-repeating L-mer on the genome, this is equivalent information to having a read at every starting position on the genome. Key question: What is the minimum read length L for which the genome is uniquely reconstructible from its L-spectrum?

de Bruijn graph ATAGACCCTAGACGAT 1. Add a node for each (L-1)-mer on the genome. 2. Add k edges between two (L-1)-mers if their overlap has length L-2 and the corresponding L-mer appears k times in genome. (L = 5) TAGA AGCC AGCG GCCC GCGA CCCT CCTA CTAG ATAG CGAT AGAC

Eulerian path ATAGACCCTAGACGAT (L = 5) TAGA AGCC AGCG GCCC GCGA CCCT CCTA CTAG ATAG CGAT AGAC Theorem (Pevzner 95) : If L > max( l interleaved, l triple ), then the de Bruijn graph has a unique Eulerian path which is the original genome.

Resolving non-interleaved repeats non-interleaved repeat Unique Eulerian path. Condensed sequence graph

From dense reads to shotgun reads [Idury-Waterman 95] [Pevzner et al 01] Idea: mimic the dense read scenario by looking at K-mers of the length L reads Construct the K-mer graph and find an Eulerian path. Success if we have K-coverage of the genome and K > L critical

GREEDY DE B RUIJN length Lander-Waterman coverage lower bound De Bruijn algorithm: performance Human Chr 19 Build 37 Loss of info. from the reads!

Resolving bridged interleaved repeats interleaved repeat bridging read Bridging read resolves one repeat and the unique Eulerian path resolves the other.

GREEDY DE B RUIJN SIMPLE B RIDGING length Lander-Waterman coverage lower bound Simple bridging: performance Human Chr 19 Build 37

Resolving triple repeats triple repeat all copies bridged neighborhood of triple repeat all copies bridged resolve repeat locally

Triple Repeats: subtleties

Multibridging De-Brujin Theorem: (Bresler,Bresler, Tse 13) Original sequence is reconstructible if: 2. interleaved repeats are (single) bridged 3. coverage 1. triple repeats are all-bridged Necessary conditions for ANY algorithm: 1.triple repeats are (single) bridged 1.interleaved repeats are (single) bridged. 2.coverage.

GREEDY DE B RUIJN SIMPLE B RIDGING MULTI B RIDGING length Lander-Waterman coverage lower bound Multibridging: near optimality for Chr 19 Human Chr 19 Build 37

Rhodobacter sphaeroides GAGE Benchmark Datasets Staphylococcus aureus G = 4,603,060 G = 2,903,081 G = 88,289,540 Human Chromosome14 http://gage.cbcb.umd.edu/ MULTI B RIDGING lower bound MULTI B RIDGING lower bound MULTI B RIDGING lower bound L critical L critical = length of the longest triple or interleaved repeat.

Gap Sulfolobus islandicus. G = 2,655,198 triple repeat lower bound interleaved repeat lower bound MULTIBRIDGING algorithm

Complexity: Computational vs Informational Complexity of MULTIBRIDGING –For a G length genome, O(G 2 ) Alternate formulations of Assembly –Shortest Common Superstring: NP-Hard –Greedy is O(G), but only a 4-approximation to SCS in the worst case –Maximum Likelihood: NP-Hard Key differences –We are concerned only with instances when reads are informationally sufficient to reconstruct the genome. –Individual sequence formulation lets us focus on issues arising only in real genomes.

Read Errors Error rate and nature depends on sequencing technology: Examples: Illumina: 0.1 – 2% substitution errors PacBio: 10 – 15% indel errors We will focus on a simple substitution noise model with noise parameter p. ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT T A T A CTT A

Consistency Basic question: What is the impact of noise on L critical ? This question is equivalent to whether the L-spectrum is exactly recoverable as the number of noisy reads N -> 1. Theorem (C.C. Wang 13): Yes, for all p except p = ¾.

What about coverage depth? Theorem (Motahari, Ramchandran,Tse, Ma 13): Assume i.i.d. genome model. If read error rate p is less than a threshold, then Lander-Waterman coverage is sufficient for L > L critical For uniform distr. on {A,G,C,T}, threshold is 19%. A separation architecture is optimal: error correction assembly

Why? Coverage means most positions are covered by many reads. Multiple aligning overlapping noisy reads is possible if Assembly using noiseless reads is possible if noise averaging M

From theory to practice Two issues: 1)Multiple alignment is performed by testing joint typicality of M sequences, computationally too expensive. Solution: use the technique of finger printing. 2) Real genomes are not i.i.d. Solution: replace greedy by multibridging.

X-phased multibridging Prochlorococcus marinus Substitution errors of rate 1.5 % L critical Lam, Khalak, T. Recomb-Seq 14

More results Helicobacter pylori Methanococcus maripaludis Mycoplasma agalactiae Prochlorococcus marinus L critical

A more careful look Mycoplasma agalactiae L critical L critical-approx

Approximate repeat example: Yersinia pestis exact triple repeat, length 1662 approximate triple repeat length 5608

Acknowledgements DNA Assembly RNA Assembly Guy Bresler MIT Ma’ayan Bresler Berkeley Ka Kit Lam Berkeley Asif Khalak Pacific Biosciences Lior Pachter Berkeley Joseph Hui Berkeley Kayvon Mazooji Berkeley Abolfazl Motahari Sharif Soheil Mohajer U. of Minnesota Eren Sasoglu Berkeley Sreeram Kannan Berkeley

High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.

Similar presentations

Presentation on theme: "High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.

Similar presentations

Presentation on theme: "High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong."— Presentation transcript:

Similar presentations

About project

Feedback