Download presentation
Presentation is loading. Please wait.
Published byEthan Williamson Modified over 9 years ago
1
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow
2
2
3
DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Shotgun DNA Sequencing Not all sequencing technologies produce mate-pairs. Different error models Different read lengths
4
Basic Challenge Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome Issues: –Coverage –Errors in reads –Reads vary from very short (35bp) to quite long (800bp), and are double-stranded –Non-uniqueness of solution –Running time and memory
5
DNA Sequences A DNA sequence is a string over the alphabet {A,C,G,T}. The length of the sequence is the number of letters in the string. These are all DNA sequences, and they are all substrings of a larger DNA sequence. What is that larger DNA sequence? TAATACTTAGG TAGGCCA GCCAGGAAT GAATAAGCCAA GCCAATTT AATTTGGAAT GGAATTAAGGCAC AGGCACGTTTA CACGTTAGGACCATT GGACCATTTAATACGGAT
6
Simplest scenario Reads have no error Read are long enough that each appears exactly once in the genome Each read given in the same orientation (all 5’ to 3’, for example)
7
De novo vs. comparative assembly De novo assembly means you do everything from scratch Comparative assembly means you have a “reference” genome. For example, you want to sequence your own genome, and you have Craig Venter’s genome already sequenced. Or you want to sequence a chimp genome and you have a human already sequenced.
8
De Novo assembly Much easier to do with long reads Need very good coverage Generally produces fragmented assemblies Necessary when you don’t have a closely related (and correctly assembled) reference genome
9
9 De Novo Assembly paradigms Overlap Graph: overlap-layout-consensus methods –greedy (TIGR Assembler, phrap, CAP3...) –graph-based (Celera Assembler, Arachne) k-mer graph (especially useful for assembly from short reads)
10
Example Reads: TAATACTTAGG TAGGCCA GCCAGGAAT GAATAAGCCAA GCCAATTT AATTTGGAAT GGAATTAAGGCAC AGGCACGTTTA CACGTTAGGACCATT GGACCATTTAATACGGAT If minimum overlap is 3, what graph do we get? If minimum overlap is 4, what graph do we get? If minimum overlap is 5, what graph do we get?
11
k-mer graph (a.k.a. de Bruijn graph) The reads are k-mers and each corresponds to one edge in the graph The vertices are (k-1)-mers that appear in some read, and edges defined by overlap of k-2 nucleotides. Small values of k produce small graphs. Note this creates a directed graph (“digraph”)!
12
Eulerian Paths in Digraphs An Eulerian path is one that goes through every edge exactly once. (The endpoints of the Eulerian path are distinct!) For directed graphs, the cycle will need to follow the direction of the edges (also called “arcs”). Indegree(v) = # edges coming into v Outdegree(v) = # edges leaving v A digraph has an Eulerian path if and only if indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.
13
de Bruijn Graphs are Eulerian If the k-mer set comes from a sequence and every k-mer appears exactly once in the sequence, then the de Bruijn graph has an Eulerian path!
14
de Bruijn Graph Create the de Bruijn graph for the following string, using k=5 –ACATAGGATTCAC Find the Eulerian path Is the Eulerian path unique? Reconstruct the sequence from this path
15
Constructing the de Bruijn Graph Create the de Bruijn graph for the following string, using k=5 –ACATAGGATTCAC De Bruijn graph for k=5: –vertices are 4-mers –edges are 5-mers
16
Constructing the de Bruijn Graph Create the de Bruijn graph for the following string, using k=5 –ACATAGGATTCAC Edges (5-mers): ACATA, CATAG, ATAGG, TAGGA, AGGAT, GGATT, GATTC, ATTCA, TTCAC Vertices (4-mers, two for each 5-mer): ACAT, CATA, ATAG, TAGG, AGGA, GGAT, GATT, ATTC, TCAC
17
de Bruijn Graph Create the de Bruijn graph for the following string, using k=5 –ACATAGGATTCAC Find the Eulerian path Is the Eulerian path unique? Reconstruct the sequence from this path
18
Constructing the de Bruijn Graph Create the de Bruijn graph for the following string, using k=5 –ACATAGGATTCAC Edges (5-mers): ACATA, CATAG, ATAGG, TAGGA, AGGAT, GGATT, GATTC, ATTCA, TTCAC Vertices (4-mers, two for each 5-mer): ACAT, CATA, ATAG, TAGG, AGGA, GGAT, GATT, ATTC, TCAC
19
Using de Bruijn Graphs Given: set of k-mers from a set of reads for a sequence Algorithm: Construct the de Bruijn graph Try to find an Eulerian path in the graph It’s not as simple as it looks in practice, but finding Eulerian paths is still the basis of most genome assembly methods!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.