CAP5510 – Bioinformatics Sequence Assembly

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Chapter 8 Topics in Graph Theory
BME 130 – Genomes Lecture 5 Genome assembly I The good old days.
13 May 2009Instructor: Tasneem Darwish1 University of Palestine Faculty of Applied Engineering and Urban Planning Software Engineering Department Introduction.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Graph Theory: Euler Circuits Christina Mende Math 480 April 15, 2013.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
3. Lecture WS 2003/04Bioinformatics III1 Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach.
Next Generation Sequencing, Assembly, and Alignment Methods
What is the first line of the proof? a). Assume G has an Eulerian circuit. b). Assume every vertex has even degree. c). Let v be any vertex in G. d). Let.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Approximation Algorithms for the Traveling Salesperson Problem.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Chapter 4 Graphs.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Can you connect the dots as shown without taking your pen off the page or drawing the same line twice.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Fuzzypath – Algorithms, Applications and Future Developments
CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Metagenomics Assembly Hubert DENISE
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
1 CIS 4930/6930 – Recent Advances in Bioinformatics Spring 2014 Network problems Tamer Kahveci.
Graph theory and networks. Basic definitions  A graph consists of points called vertices (or nodes) and lines called edges (or arcs). Each edge joins.
Review Euler Graph Theory: DEFINITION: A NETWORK IS A FIGURE MADE UP OF POINTS (VERTICES) CONNECTED BY NON-INTERSECTING CURVES (ARCS). DEFINITION: A VERTEX.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Costas Busch - LSU 1 More NP-complete Problems. Costas Busch - LSU 2 Theorem: If: Language is NP-complete Language is in NP is polynomial time reducible.
Short reads: 50 to 150 nt (nucleotide)
CSCI2950-C Genomes, Networks, and Cancer
NP-completeness Ch.34.
Assembly algorithms for next-generation sequencing data
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CSC 172 DATA STRUCTURES.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
More NP-complete Problems
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
CSC 172 DATA STRUCTURES.
Can you draw this picture without lifting up your pen/pencil?
Graph Theory.
Graph Algorithms in Bioinformatics
Introduction to Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
Presentation transcript:

CAP5510 – Bioinformatics Sequence Assembly Tamer Kahveci CISE Department University of Florida

What is Sequence Assembly? We can only sequence short fragments (100 – 500 bases). How can we sequence long sequences (e.g., single chromosome can have hundreds of millions of bases) ? Chop long sequence to many small fragments Sequence all fragments Put them together to construct the long sequence Problem: Consider a long sequence S. Given a collection of subsequences (aka fragments or reads) of S, denoted with R = {r1, r2, …, rn}. Construct S from R

Sequence Assembly Coverage: average number of reads in R containing a base in S. Issues: Errors in R Repeats in S Repeat

Assemblers De novo: No knowledge known about S. Slow Phusion (Mullikin & Ning 2003) Arachne (Batzoglou et al. 2002) CAP (Huang & Madan, 1992) Mapping: A similar sequence to S is known. Needs prior knowledge on S. Shrimp (Rumble et al. 2009)

Phusion (Mullikin & Ning 2003) Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Create a histogram of k-mers (k = 17) Remove repetitive ones (13 or more occurrences)

Phusion (Mullikin & Ning 2003) Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Create a histogram of k-mers (k = 17) Remove repetitive ones (13 or more occurrences) Keep a list for each k-mer showing the reads that contain it. Find all pairs of reads sharing at least one k-mer Keep the number of common k-mers for each such pair

Phusion (Mullikin & Ning 2003) Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Assemble each cluster into a contig Given a pair of reads, extend their matching k-mers Join overlapping contigs If two contigs share a read, try to put them together into a longer contig by splicing them first.

Euler (Pevzner et al. 2001) Clipping: Remove low quality reads, clip ends. Clustering: Group similar reads together. Assemble each cluster into a contig Create de Brujin graph Each node is a k-mer A directed edge indicates a dove tail overlap of k-1 positions Find the Eulerian path on this graph (visit each edge once) – polynomial Not the Hamiltonian path (visit each vertex once) – NP complete