Download presentation
Presentation is loading. Please wait.
Published byMartina Robbins Modified over 9 years ago
1
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya
2
Master Course Fourth lecture: Sequence assembly
3
It is applied to the following topics: EST assembly DNA sequencing.
4
Hibridization: provide information about l-tuples present in DNA. DNA sequencing There are two techniques: Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.
5
Hibridization: provide information about l-mers present in DNA DNA sequencing There are two techniques: Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.
6
Hybridization Let xxxxxxxxxxxxx be the sequence we want to know, and the hybridization technique gives us the set of 3-mers that belong to it: AACGATTGC ACGCGGGCCTTG GGA ATT How can the sequence be reconstructed?
7
Hybridization As AAC and ACG belong to the sequence, then AACG belongs to the sequence, AACGATTGC ACGCGGGCCTTG GGA ATT Given the 3-mers of the sequence: because the longest (proper) suffix of AAC matches the longest (proper) prefix of ACG. This relation can be represented with a directed graph AAC ACG
8
Hybridization Construction of the complete suffix-prefix graph AACGATTGC ACGCGGGCCTTG GGA ATT AACGGATTGCC that gives us the unknown sequence: But, is this a realistic case?
9
Hybridization Let us introduce a more realistic case: and the sequence is given by the Hamiltonian path Which is the cost of the hybridization method? AACCAAGATTGC ACGCGGGCCTTG GGCGGA CCG ATT and whose cost is NP-Complet! that is the path that traverses all nodes exactly once
10
2. Searching for the suffix-prefix matches : Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4 L l-mers of length L that should be generated If there are m L-mers, then there are O(m 2 L 2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet
11
Excursió: cost Quadratic cost: O(m 2 ) Linear cost: O(m) Exponencial cost: O(2 m ) m t = 1 mseg 10m 10t = 10 mseg 1000m 1000t = 1 seg m t = 1mseg. 10m 100t = 100 mseg. 1000m 1000000t = 16 min m t = 1 mseg. 10m 2 10 t = 1 seg 1000m 2 1000 t = 10 30 t = 10 18 anys
12
2. Searching for the suffix-prefix matches : Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4 L l-mers of length L that should be generated If there are m L-mers, then there are O(m 2 L 2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet How the NP-completness can be avoided?
13
Hybridization: Search for the Hamiltonian path (NP-complet) AACGATTGC ACGCGGGCCTTG GGCGGA CCG ATT or search for the Eulerian path (lineal) AA AC GG CG GA CC GC TG TT AT
14
Hybridization: Eulerian path Unbalanced nodes: indegree = outdegree (Starting or ending nodes ) Balanced nodes: indegree = oudegree (traversed nodes: ) Search for the Eulerian path of the graph:
15
Hybridization: Eulerian path Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
16
Hybridization: camí Eulerià Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
17
2. Searching for the suffix-prefix matches : Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4 L l-mers of length L that should be generated If there are m L-mers, then there are O(m 2 L 2 ) comparisons 3. Searching for the Eulerian path Linear cost Now, which is the limiting factor?
18
Hybridization: limiting factor AACCAAGATTGC ACGCGGGCCTTG GGA ATT Repeated l-mers: Which is the probability of a repeat? CAACGGATTGCC CAACGGACGGATTGCC GAC Given the graph: How many sequences can be assembled?
19
Hybridization: statistical model Model: random sequence of length N with identically distributed bases (1/4), How the probability of a repeat can be computed? Given 2 l-mers, the probability to match is : 4 -L Given 3 l-mers, the expected number of 2-matches is : ( 3 2 )4 -L Given m l-mers, the expected number of 2-matches is: ( m 2 )4 -L If ( m 2 )4 -L <1 then m<sqr(2·4 L ) then for L = 8, m =512! Conclusion: this technique can be applied only to short sequences.
20
Hybridization: Connect to http://alggen.lsi.upc.edu And follow links RESEARCH SEARCH MREPATT Genome sequences are close to random sequences?
22
Hibridizationació: provide information about l-mers present in DNA DNA sequencing There are two techniques: Shot gun: DNA sequences are broken into 100Kb-500Kb random fragments.
23
Shotgun With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx It is possible : to make some copies to break it into random and unsorted short segments What can we do?
24
Shotgun: algorisme Assume xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxx xxxx|xxxxxx|xxxxxx|xxxxxxx The algorithm is: 1st. Compare all pairs searching for suffix-prefix approximate matches. 2nd. Construct the graph suffix-prefix 3th. Find the path
25
Shotgun Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx The shotgun brokes it into the following segments accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt
26
Shotgun The pairwise comparison that searchs for suffix-prefix approximate matching can be done with: Dynamic programming ( quadratic cost) two steps: Find the pairs suspected to be assembled (Linear cost with the hash algorithm) Assembly them with dynamic programming.
27
Shotgun accgtacc accttta tacctt tttaac taacga acgatac accg accgt tacaggt gataca Given the graph accgtacctttaacgatacaggt but, the Hamiltonian has exponential cost!
28
Shotgun: New problems arise xxxxxxx xxxxxx xxxxx xxxxxx xxxxxxxx accg accgt xxxxxxx Consecutive repeats Lack of coverage …
29
Shotgun: properties of the coverage Given the coverage: Some questions arisess: What is the mean length of contigs? How many contigs we have to expect? What is the percentage of coverage?
30
Shotgun: percentage of coverage Degree of coverage N d / L Given the model L N d We assume that segments are randomly distributed. a base was covered by k segments is given by the binomial dsitribution ( N,d / L ): The probability that Prob{X=k}= (d/L) k (1-d/L) n-k N k
31
Then the probability that at least one segment covers a base is Prob{X>0}= 1-Prob{X=0}= 1- e - Shotgun: percentage of coverage What is the limit of the binomial distribution n i p 0 having np= Distribució de Poisson P( ) Prob{X=k}= e - k k! = 1- e (N d / L) Then, with N d / L = 4.6 we obtain a 99% of coverage and with N d / L = 6.9 weobtain a 99.9% of coverage.
32
Assembly of ESTs Is the same procedure than shotgun sequencing… …but with a great one advantage: there are many graphs with a small number of nodes! Connect to http://alggen.lsi.upc.es Links RESEARCH ESSEM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.