Hibridization: provide information about l-tuples present in DNA. DNA sequencing There are two techniques: Shotgun: DNA sequences are broken into 100Kb-500Kb.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell
What is the first line of the proof? a). Assume G has an Eulerian circuit. b). Assume every vertex has even degree. c). Let v be any vertex in G. d). Let.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Advanced Topics in Data Mining Special focus: Social Networks.
Approximation Algorithms for the Traveling Salesperson Problem.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Sequencing a genome and Basic Sequence Alignment
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.
Physical Mapping of DNA Shanna Terry March 2, 2004.
MCS312: NP-completeness and Approximation Algorithms
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Chapter 2 Graph Algorithms.
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Sequencing a genome and Basic Sequence Alignment
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
Construction and Analysis of Efficient Algorithms
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Introduction to Genome Assembly
Graph Algorithms in Bioinformatics
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithm.
Bin Fu Department of Computer Science
Genome Assembly.
Graph Algorithms in Bioinformatics
Intro to Alignment Algorithms: Global and Local
Graph Algorithms in Bioinformatics
Route Inspection Which of these can be drawn without taking your pencil off the paper and without going over the same line twice? If we introduce a vertex.
CSE 5290: Algorithms for Bioinformatics Fall 2009
Fragment Assembly 7/30/2019.
Presentation transcript:

Hibridization: provide information about l-tuples present in DNA. DNA sequencing There are two techniques: Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.

Hibridization: provide information about l-mers present in DNA DNA sequencing There are two techniques: Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.

Hybridization Let xxxxxxxxxxxxx be the sequence we want to know, and the hybridization technique gives us the set of 3-mers that belong to it: AACGATTGC ACGCGGGCCTTG GGA ATT How can the sequence be reconstructed?

Hybridization As AAC and ACG belong to the sequence, then AACG belongs to the sequence, AACGATTGC ACGCGGGCCTTG GGA ATT Given the 3-mers of the sequence: because the longest (not proper) suffix of AAC matches the longest (not proper) prefix of ACG. This relation can be represented with a directed graph AAC ACG

Hybridization Construction of the complete suffix-prefix graph AACGATTGC ACGCGGGCCTTG GGA ATT AACGGATTGCC that gives us the unknown sequence: But, is this a realistic case?

Hybridization Let us introduce a more realistic case: and the sequence is given by the Hamiltonian path Which is the cost of the hybridization method? AACCAAGATTGC ACGCGGGCCTTG GGCGGA CCG ATT and whose cost is NP-Complet! that is the path that traverses all nodes exactly once

2. Searching for the suffix-prefix matches : Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4 L l-mers of length L that should be generated If there are m L-mers, then there are O(m 2 L 2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet

Excursió: cost Quadratic cost: O(m 2 ) Linear cost: O(m) Exponencial cost: O(2 m ) m t = 1 mseg 10m 10t = 10 mseg 1000m 1000t = 1 seg m t = 1mseg. 10m 100t = 100 mseg. 1000m t = 16 min m t = 1 mseg. 10m 2 10 t = 1 seg 1000m t = t = anys

2. Searching for the suffix-prefix matches : Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4 L l-mers of length L that should be generated If there are m L-mers, then there are O(m 2 L 2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet How the NP-completness can be avoided?

Hybridization: Search for the Hamiltonian path (NP-complet) AACGATTGC ACGCGGGCCTTG GGCGGA CCG ATT or search for the Eulerian path (lineal) AA AC GG CG GA CC GC TG TT AT

Hybridization: Eulerian path Unbalanced nodes: indegree = outdegree (Starting or ending nodes ) Balanced nodes: indegree = oudegree (traversed nodes: ) Search for the Eulerian path of the graph:

Hybridization: Eulerian path Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

Hybridization: camí Eulerià Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

2. Searching for the suffix-prefix matches : Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4 L l-mers of length L that should be generated If there are m L-mers, then there are O(m 2 L 2 ) comparisons 3. Searching for the Eulerian path Linear cost Now, which is the limiting factor?

Hybridization: limiting factor AACCAAGATTGC ACGCGGGCCTTG GGA ATT Repeated l-mers: Which is the probability of a repeat? CAACGGATTGCC CAACGGACGGATTGCC GAC Given the graph: How many sequences can be assembled?

Hybridization: statistical model Model: random sequence of length N with identically distributed bases (1/4), How the probability of a repeat can be computed? Given 2 l-mers, the probability to match is : 4 -L Given 3 l-mers, the expected number of 2-matches is : ( 3 2 )4 -L Given m l-mers, the expected number of 2-matches is: ( m 2 )4 -L If ( m 2 )4 -L <1 then m<sqr(2·4 L ) then for L = 8, m =512! Conclusion: this technique can be applied only to short sequences.

Hibridizationació: provide information about l-mers present in DNA DNA sequencing There are two techniques: Shot gun: DNA sequences are broken into 100Kb-500Kb random fragments.

Shotgun With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx It is possible : to make some copies to break it into random and unsorted short segments What can we do?

Shotgun: algorisme Assume xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxx xxxx|xxxxxx|xxxxxx|xxxxxxx The algorithm is: 1st. Compare all pairs searching for suffix-prefix approximate matches. 2nd. Construct the graph suffix-prefix 3th. Find the path

Shotgun Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx The shotgun brokes it into the following segments accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt

Shotgun The pairwise comparison that searchs for suffix-prefix approximate matching can be done with: Dynamic programming ( quadratic cost) two steps: Find the pairs suspected to be assembled (Linear cost with the hash algorithm) Assembly them with dynamic programming.

Shotgun accgtacc accttta tacctt tttaac taacga acgatac accg accgt tacaggt gataca Given the graph accgtacctttaacgatacaggt but, the Hamiltonian has exponential cost!

Shotgun: New problems arise xxxxxxx xxxxxx xxxxx xxxxxx xxxxxxxx accg accgt xxxxxxx Consecutive repeats Lack of coverage …

Shotgun: properties of the coverage Given the coverage: Some questions arisess: What is the mean length of contigs? How many contigs we have to expect? What is the percentage of coverage?

Shotgun: percentage of coverage Degree of coverage N d / L Given the model L N d We assume that segments are randomly distributed. a base was covered by k segments is given by the binomial dsitribution ( N,d / L ): The probability that Prob{X=k}= (d/L) k (1-d/L) n-k N k

Then the probability that at least one segment covers a base is Prob{X>0}= 1-Prob{X=0}= 1- e - Shotgun: percentage of coverage What is the limit of the binomial distribution n  i p 0 having np= Distribució de Poisson P( ) Prob{X=k}= e - k k! = 1- e (N d / L) Then, with N d / L = 4.6 we obtain a 99% of coverage and with N d / L = 6.9 weobtain a 99.9% of coverage.