Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch ( Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya
Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 4. Approximate string matching (Dynamic programming) 5. Pairwise and multiple alignment 6. Suffix trees 3. Extended string matching and regular expressions
Master Course Second lecture: First part: Extended string matching
There are characters in the text that represent sets of simbols 1. Classes of characters in the tetx. There are characters in the text that represent sets of simbols 2. Classes of characters in the pattern. There are classes of characters represented by one Symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)
Classes in the text Algorismes més eficients (Navarro & Raffinot) | | Long. patró Horspool BNDM BOM w
Classes in the text :Horspool example Given the pattern ATGTA the shift table is: A 4 C 5 G 2 T 1 R ? … N ?
Classes in the text :Horspool example Suposem que el patró és ATGTA La taula de salts seria: A 4 C 5 G 2 T 1 R 2 … N ?
Classes in the text :Horspool example Given the pattern ATGTA and the shift table: A 4 C 5 G 2 T 1 R 2 … N 1 Given the taxt :G T A R T R N A A G G A … A T G T A
Classes in the text :Horspool example Given the pattern ATGTA and the shift table: A 4 C 5 G 2 T 1 R 2 … N 1 IGiven the text :G T A R T R N A A G G A... A T G T A …
Classes in the text Algorismes més eficients (Navarro & Raffinot) | | Long. patró Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
Alg. Cerca exacta d’un patró (text on-line) Algorismes més eficients (Navarro & Raffinot) | | Long. patró Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
Classes in the text: BOM Com es determina la següent posició de la finestra? Com fa la comparació? Text : Patró : Autòmata: Factor Oracle Comproba si el sufix és factor del patró Però primer analitzem com fa la comparació…
Classes in the text: BOM example Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG I la cerca sobre el text :G T A R T R N A A T G… A T G T A T G Com fa la comparació? GGATT AT T A G No és possible cap millora!
Alg. Cerca exacta de molts patrons | | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC
Classes in the text: Set Horspool Search for the patterns ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T In the text: ARTGNCTATGTGACA… <it’s not possible any improvment!
Classes in the text | | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC
Classes in the pattern Algorismes més eficients (Navarro & Raffinot) | | Long. patró Horspool BNDM BOM w
Classes in the text | | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC
Alg. Cerca exacta de molts patrons | | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC
Alg. Cerca exacta de molts patrons | | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC
Master Course Second lecture: Second part: Regular expressions matching
Expressions regulars Una expressió regular ℛ és una cadena sobre Σ U { ε, |, ·, *, (, ) } definida recursivament com: ε és una expressió regular Un caràcter de Σ és una expressió regular ( ℛ ) és una expressió regular ℛ 1 · ℛ 2 és una expressió regular ℛ * és una expressió regular ℛ 1 | ℛ 2 és una expressió regular
Llenguatge regular El llenguatge representat per una expressió regular és el conjunt dels mots que es poden construir a partir de l’expressió regular. El problema de buscar una expressió regular dins el text és el de buscar tots els factors que pertanyen al respectiu llenguatge regular.
Master Course Second lecture: Third part: Approximate string matching
For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel
Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings
Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”
Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel
Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel
Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings
Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel
Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings
Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.
Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel
Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings
Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”
Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel
Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)=d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel
Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings
Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= Indel
Edit distance We accept three types of errors: The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one 3. Deletion: ACCGTGAT ACCGGAT 2. Insertion: ACCGTGAT ACCGATGAT 1. Mismatch: ACCGTGAT ACCGAGAT d(ACT,ACT)=0 d(ACT,AC)=1d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2 Indel
Edit distance and alignment of strings ACT and ACT : ACT ACT ACTTG and ATCTG: ACT and AC: ACT AC- ACTTG ATCTG ACT - TG A - TCTG Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? The Edit distance is related with the best alignment of strings
Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A ?
Edit distance and alignment of strings C T A C T A C T A C G T 0 A C T G A ?
Edit distance and alignment of strings C T A C T A C T A C G T 0 1 A C T G A - C ?
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A - - CT ?
Edit distance and alignment of strings C T A C T A C T A C G T … A C T G A CTACTA
Edit distance and alignment of strings C T A C T A C T A C G T … A ? C ? T ? G A
Edit distance and alignment of strings C T A C T A C T A C G T … A 1 C 2 T 3 G… A ACT - - -
C T A C T A C T A C G T … A 1 C 2 T 3 G A C T A C T A C T A C G T A C T G A Edit distance and alignment of strings BA(AC,CTA) - C BA(A,CTA) CCCC BA(A,CTAC) C - BA(AC,CTAC)= best d(AC,CTAC)=min d(AC,CTA)+1 d(A,CTA) d(A,CTAC)+1
C T A C T A C T A C G T A C T G A Edit distance and alignment of strings d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA) …..+1 d(AC,CTA)+1 C T A C T A C T A C G T … A 1 C 2 T 3 G A
Edit distance and alignment of strings Connect to and use the global method.
Edit distance and alignment of strings How this algorithm can be applied to the approximate search? to the K-approximate string searching?
K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell …
K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters
K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters
Master Course Second lecture: Fourth part: Pairwise and multiple alignment
Bioinformatics Pairwise and multiple alignment
Pairwise alignment Edit distance: match=0mismatch=1 indel=1 d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1 Similarity: match=1 mismatch=-1indel=-2 s(A,CTAC)-2 s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2 - +
C T A C T A C T A C G T A C T C T A C T A C T A C G T … A-2 C-4 T-6 Similarity: match=1 mismatch=-1indel=-2 Pairwise alignment s(A,CTAC)-2 s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2 - +
Pairwise alignment Connect to Links to TEACHING EMBER LePA
A C A __ Pairwise to multiple alignment What happens with three strings? Let n be their lenght, then the cost becomes S3S3 S2S2 S1S1 O(n 3 )O(2 3 )O(3 2 ) And with k strings? O(n k 2 k k 2 )
Multiple alignment Programs of multialignment use different heuristics: n Clustal (Progressive alignment) Clustal n TCoffee (Progressive alignment + data bases) TCoffee n HMM (Hidden Markov Models)
Multiple alignment Connect to and follow the links TEACHING EMBER.