Presentation is loading. Please wait.

Presentation is loading. Please wait.

CAP5510 – Bioinformatics Sequence Assembly

Similar presentations


Presentation on theme: "CAP5510 – Bioinformatics Sequence Assembly"— Presentation transcript:

1 CAP5510 – Bioinformatics Sequence Assembly
Tamer Kahveci CISE Department University of Florida

2 What is Sequence Assembly?
We can only sequence short fragments (100 – 500 bases). How can we sequence long sequences (e.g., single chromosome can have hundreds of millions of bases) ? Chop long sequence to many small fragments Sequence all fragments Put them together to construct the long sequence Problem: Consider a long sequence S. Given a collection of subsequences (aka fragments or reads) of S, denoted with R = {r1, r2, …, rn}. Construct S from R

3 Sequence Assembly Coverage: average number of reads in R containing a base in S. Issues: Errors in R Repeats in S Repeat

4 Assemblers De novo: No knowledge known about S.
Slow Phusion (Mullikin & Ning 2003) Arachne (Batzoglou et al. 2002) CAP (Huang & Madan, 1992) Mapping: A similar sequence to S is known. Needs prior knowledge on S. Shrimp (Rumble et al. 2009)

5 Phusion (Mullikin & Ning 2003)
Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Create a histogram of k-mers (k = 17) Remove repetitive ones (13 or more occurrences)

6 Phusion (Mullikin & Ning 2003)
Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Create a histogram of k-mers (k = 17) Remove repetitive ones (13 or more occurrences) Keep a list for each k-mer showing the reads that contain it. Find all pairs of reads sharing at least one k-mer Keep the number of common k-mers for each such pair

7 Phusion (Mullikin & Ning 2003)
Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Assemble each cluster into a contig Given a pair of reads, extend their matching k-mers Join overlapping contigs If two contigs share a read, try to put them together into a longer contig by splicing them first.

8 Euler (Pevzner et al. 2001) Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Assemble each cluster into a contig Create de Brujin graph Each node is a k-mer A directed edge indicates a dove tail overlap of k-1 positions Find the Eulerian path on this graph (visit each edge once) – polynomial Not the Hamiltonian path (visit each vertex once) – NP complete


Download ppt "CAP5510 – Bioinformatics Sequence Assembly"

Similar presentations


Ads by Google