Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona.

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
MSc Bioinformatics for H15: Algorithms on strings and sequences
Measuring the degree of similarity: PAM and blosum Matrix
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Bioinformatics and Phylogenetic Analysis
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Algorismes de cerca Algorismes de cerca: definició del problema (text,patró) depèn de què coneixem al principi: Cerca exacta: Cerca aproximada: 1 patró.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Suffix Trees ALGGEN: Algorithmics and genetics group Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Dr. Xavier Messeguer.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Novel computational methods for large scale genome comparison PhD Director: Dr. Xavier Messeguer Departament de Llenguatges i Sistemes Informàtics Universitat.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 3 Computational Molecular Biology Michael Smith
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:
Suffix Trees ALGGEN: Algorithmics and genetics group Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Dr. Xavier Messeguer.
Sequence Alignment.
Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short.
Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch ( LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (
Sequence comparison: Local alignment
Sequence Alignment 11/24/2018.
Tècniques i Eines Bioinformàtiques
String Matching 11/04/2019 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns.
Presentation transcript:

Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch ( LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona Supercomputing Center Universitat Politècnica de Catalunya

Contents 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short sequences ( up to bps) Dot Matrix Pairwise align. Multiple align. Hash alg. 3. Comparison of large sequences ( more that bps) Data structures Suffix treesMUMs 4. String matching

2. Comparison of short sequences (< bps) Summary (more or less) 2.1 Dot matrix 2.2 Pairwise alignment. 2.3 Hash algorithms. 2.4 Multiple alignment.

2.2 Pairwise alignment Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) from the alphabet {a,c,t,g} we say that A* and B* from {a,c,t,g,-} are aligned iff i)A* and B* become A and B if gaps ( – ) are removed. ii)|A*|=|B*| iii)For all i, it is not possible that a i = b i = - Which is the best alignment? How many alignments of two sequences exist? MALIG (an example)MALIG

2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b 3 #(a 1,b 1 )

2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b

2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b ?

2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) there are: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with (a n,-) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with (-,b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with (a n,b m ) a1a2a3a1a2a3 b 1 b 2 b ?

2.2 Number of alignments Given two DNA sequences A (a 1 a 2...a n ) and B (b 1 b 2...b m ) then: #(a 1 a 2...a n,b 1 b 2...b m ) = #(a 1 a 2...a n-1,b 1 b 2...b m ) those that end with ( a n, -) + #(a 1 a 2...a n,b 1 b 2...b m-1 ) those that end with ( -, b m ) + #(a 1 a 2...a n-1,b 1 b 2...b m-1 ) those that end with ( a n, b m ) a1a2a3a1a2a3 b 1 b 2 b But, what is the assymptotic value?

2.2 Assymptotic value > Σ ( ) ( ) k=0 K=n k n k n As = ( ) n 2n #(a 1 a 2...a n,b 1 b 2...b n ) and n! ~ n n e -n (Stirling approximation) then #(a 1 a 2...a n,b 1 b 2...b n ) > 2 2n

2.2 Best alignment How can an alignment be scored? catcactactgacgactatcgtagcgcggctatacatctacgccaa- ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg---- * * *** * * ** * ******* * * **** **** ******* * **** ** * *** How can the best alignment be found? Gap: worst case Mismatch: unfavorable Match: favorable Then we assign a score for each case, for example 1,-1,-2.

2.2 Edit distance and alignment of strings The best alignment of two strings … …is related with the edit distance, first discussed in The most efficient algorithm was proposed in 1968 and in 1970 using the technique called “Dynamic programming”

2.2 Best alignment C T A C T A C T A C G T A C T G A

2.2 Best alignment C T A C T A C T A C G T A C T G A

2.2 Best alignment C T A C T A C T A C G T A C T G A The cell contains the score of the best alignment of AC and CTACT.

2.2 Best alignment C T A C T A C T A C G T 0 A C T G A ?

2.2 Best alignment C T A C T A C T A C G T 0 -2 A C T G A - C ?

2.2 Best alignment C T A C T A C T A C G T A C T G A - - CT ?

2.2 Best alignment C T A C T A C T A C G T … A C T G A CTACTA

2.2 Best alignment C T A C T A C T A C G T … A ? C ? T ? G A

2.2 Best alignment C T A C T A C T A C G T … A-2 C-4 T -6 G… A ACT - - -

C T A C T A C T A C G T A C T G A 2.2 Best alignment C T A C T A C T A C G T … A-2 C-4 T -6 G A BA(AC,CTA) - C BA(A,CTA) CCCC BA(A,CTAC) C - BA(AC,CTAC)= best s(AC,CTAC)=max s(AC,CTA)-2 s(A,CTA)+1 s(A,CTAC)-2

Best alignment accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Given the maximum score, how can the best alignment be found? Quadratic cost in space and time Up to 10,000 bps sequences in length Download alggen tool

2.2 Some slides revisited We have developed the theory according to the following principles: 1) Both sequences have a similar length (global). 2) The model of gaps is linear If there are k consecutive gaps the penalty scores k(-2).

Assume that we have sequences with different length S 1 S Semiglobal pairwise alignment It is meaningless to introduce gaps until both sequences have similar length …. The most probable alignment should be How can these alignments be found? Final gaps Initial gaps

2.2 Semiglobal pairwise alignment C T A C T A C T A C G T A C T Initial gaps Note that Final gaps

2.2 Semiglobal pairwise alignment C T A C T A C T A C G T A C T The cell contains the score of the best alignment of CTA with the empty sequence. Given a cell

2.2 Semiglobal pairwise alignment C T A C T A C T A C G T … A C T The contribution of the initial gaps is disregarded, then C T A C T A C T A C G T … A 1 C 2 T 3 but, what happens with the final gaps?

2.2 Semiglobal pairwise alignment C T A C T A C T A C G T … A 1 C 2 T 3 Practice with the alggen tool. … by checking the last row for the best score. How does the algorithm search for the best alignment?

2.2 Affine-gap model score Given the following alignments that have the same score … a g t a c c c c g t a g a g t - c c - - g t a - a g t a c c c c g t a g a g t - c - c - g t a - a g t a c c c c g t a g a g t - c - - c g t a - a g t a c c c c g t a g a g t - - c c - g t a - a g t a c c c c g t a g a g t - - c - c g t a - a g t a c c c c g t a g a g t c c g t a - Which is the most reliable case from a biological point of view?

2.2 Affine-gap model score Then, how can we distinguish between consecutive gaps and separated gaps? a g t a c c c c g t a g a g t - - c - c g t a - a g t a c c c c g t a g a g t c c g t a - By scoring the opening gaps greater than the extension gaps, for instance, -10 and Then, the penalty of k consecutive gaps becomes OG + (k-1) EG which is an affine-gap function. How is the best alignment found?.

C T A C T A C T A C G T A C T G A 2.2 Affine-gap model score Smallest arrows: refer to the introduction of an opening gap. Largest arrows: refer to the introduction of an extension gap. But from which cell do the largest arrows originate?

C T A C T A C T A C G T A C T G A 2.2 Affine-gap model score In both cases we know which cell contributes with the minimum penalty score. Acces to clustalW:

2.2 Local alignment Given two sequences, we can consider the alignments of all their substrings… …how can the best of them be found? Two questions arise: - how can the alignments be compared? - how can the best one be selected?

2.2 Local alignment Given a path Imagine the graph of the scores: can the best subalignments be detected? accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t … It suffices to compare the value of each cell with zero!