Bioinformatics.

Slides:



Advertisements
Similar presentations
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Advertisements

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Measuring the degree of similarity: PAM and blosum Matrix
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Heuristic Approaches for Sequence Alignments
Introduction to Bioinformatics Algorithms Sequence Alignment.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
15-853:Algorithms in the Real World
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Core String Edits, Alignments, and Dynamic Programming.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
CSE 589 Applied Algorithms Spring 1999
Bioinformatics Algorithms and Data Structures
Basic Local Alignment Search Tool (BLAST)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

Bioinformatics

Techniques for analysis and comparison of biological data sequences Approximate Pattern Matching Multiple Sequence Alignment Applications in Molecular Biology Problems

Basic definitions (α) Edit Distance: for 2 strings we define as edit distance the minimum number of operations that are needed in order to transform the first string to the second. The basic operations are insertion, deletion and replacement of symbols. Example: S1: vintner and S2: writers edit-distance(S1->S2)=5

Basic definitions (β) Insertion: I Deletion: D Replacement: R Edit Transcript: for the transformation of a string, we define the sequence of operations that are needed in order to transform the first string to the second. The basic operations are: Insertion: I Deletion: D Replacement: R Matching: Μ Example: S1: vintner and S2: writers edit-distance(S1->S2)= RIMDMDMMI

Example of Substitution Matrix

Sequence Alignment Sequence Alignment: we put the one sequence below the other so that common characters are put in the same positions.

Sequence alignment allowing gaps We align two sequence by introducing 7 gaps in 4 positions, that are translated to changing the DNA sequence in the respective positions.

Computing the score-function when aligning sequences

The method of Dynamic Programming consider 2 sequences S1 and S2, we define as D(i,j) the edit distance between prefixes S1[1..i] and S2[1..j], that is the minimum number of operations that are needed in order to transform the i first characters of S1 to the j first characters of S2. Use 3 basic techniques: recurrence relation, tabular computation, traceback.

Example of Dynamic Programming Table

Recurrence relationship D(i,j)=min[D(i-1,j)+1,D(i,j-1)+1,D(i-1,j-1)+t(i,j)] D(i,j-1)+1: we must insert S2[j] D(i-1,j)+1: we must delete S1[i], D(i-1,j-1)+1: in order to map S1[i] to S2[j] we must replace S1[i], with S2[j], D(i-1,j-1): matching

Example: recurrence relationship

Traceback relationship From (i,j) to (i,j-1) αν D(i,j)= D(i,j-1)+1 (character insertion) From (i,j) to (i-1,j) if D(i,j)= D(i-1,j)+1 (character deletion) From (i,j) to (i-1,j-1) if D(i,j)= D(i-1,j-1)+t(i,j) (character replacement or matching)

Adding traceback pointers

Interpreting traceback pointers

Complexity of Dynamic Programming Methods Initialization: O(n) + O(m) Recurrence relationship: O(n*m) Traceback pointers: Ο(n+m) Complexity: Ο(n2) Equivalence with a problem of graph theory where every node has as label (i,j)

Basic definitions (γ) Insertion or deletion: d Replacement : r Weighted Edit Distance: the minimum number of operations that must be done to transform the first string to the second. Every operation has a specific weight-cost. Assume that the basic operations have the following weights: Insertion or deletion: d Replacement : r Matching : m. Example : S1: vintner and S2: writers weighted edit-distance(S1->S2)= r+4d+4m.

Recurrence relationship with weights D(i,j)=min[D(i-1,j)+d,D(i,j-1)+d,D(i-1,j-1)+t(i,j)], where: t(i,j)= e, if S1(i)=S2(j), t(i,j)=r, if S1(i)S2(j) and D(i,0)=i*d and D(0,j)=j*d.

Dynamic programming & sequence similarity based on alphabet Recurrence relationship for sequence similarity : V(i,j)= max[V(i-1,j-1)+s(S1(i), S2(j)), V(i-1,j)+ s(S1(i),_), V(i,j-1)+ s(_,S2(j))], where: s(x,y): the alignment value of character x and y V(0,j)=s(_,S2(k)), 1<k<j and V(i,0)=  s(S1(k),_), 1<k<i.

EXTENTIONS Weighted Edit Distance. In Molecular Biology applications the character subsitution weights are stored in substitution matrices - Substitution Matrix: PAM και BLOSUM It is better to face it as alignment and embed the score. Every match is positive , everything else is 0 or negative

Extentions Longest Common Subsequence (match weight 1, otherwise 0) End-space free variant that encourages one string to align in the interior of the other, or the suffix of one to align with the prefix of the other (initial conditions 0, everything is countable in the last row and the last column) – shotgun sequence assembly Approximate occurrence of P in T (the optimal alignment of P to a substring of T has distance δ from the optimal alignment) (initial conditions 0). -- locate a cell (n,j) with value greater than δ. -- traverse backpointers from (n.j) to (0,k). -- occurrence in T[k,j]

Local Suffix Alignment Problem Local suffix alignment problem: for two sequence S1 and S2 locate a suffix α of S1[1..i] (with the probability of being empty) and a suffix β of S2[1..j] (probably empty) so that V(α,β) has the maximum value of all possible pairs of suffices of S1[1..i] and S2[1..j]. We symbolize as υ(i,j) the maxuimum local suffix alignment for the values i and ,j ( i<n and j<m). Initial conditions v(i,0)=0 and v(0,j)=0 as we can select an empty suffix. υ(i,j)=max[0,υ(i-1,j-1)+s(S1(i), S2(j)), υ(i-1,j)+ s(S1(i),_), υ(i,j-1)+ s(_,S2(j))]

Locally Similar Strings

Remarks Global Alignment is called Needleman-Wunsch alignment Local alignment is called Smith-Waterman alignment Smith-Waterman can find regions with high similarity by simply performing a trace-back from any cell (i,j) backwards and locate a pair with similarity v(i,j).

Sequence alignment with gaps Gap: contiguous spaces, we want to control the distribution of gaps. Introducing gaps: In order to compute the gap the gap introduces in aligning 2 sequences, we can with a simple approach consider that every gap contribute a constant weight Wg, independent of its length. Value of alignment for “k” gaps: A better approach is using a function that is a function of the gap length. Then we can fill a matrix: V(i,j)=max[E(i,j), F(i,j), G(i,j)]

Alignment with arbitrary gap weights j G(i,j)= V(i-1,j-1)+cost(i->j) E(i,j)= maxKV(i,k) - w(j-k) (0<=k<=j-1) i gaps j F(i,j)= maxl{V(l,j) - w(i-l) (0<=l<=i-1) i gaps j

V(i,j)=max{E(i,j), F(i,j), G(i,j)} V(i,0)=-w(i) V(0,j)=-w(j) E(i,0)=-w(i) F(0,j)=-w(j) G(0,0)=0 Assuming that |S1|=n and |S2|=m the recurrences can be evaluated in O(nm2+n2m)

Applications in Molecular Biology problems Multiple sequence alignment problem: a multiple global sequence of k>2 strings S={ S1, S2,…., Sκ} is a physical generalization of aligning for two strings.

Why we are interested in multiple sequence alignment Multiple sequence alignment is used: to identify and represent protein families and super-families, to represent characteristics that are transferred to DNA sequences or protein families, To represent the evolution history (phylogenetic trees) in DNA or protein sequences.

Multiple sequence alignment Kind of alignments: Extention of DP approach (too costly) Use of pairwise alignment (center star algorithm) Algorithms of multiple sequence alignemnt: FASTA BLAST.

Sequence Database Searching Βήματα καθορισμού πρωτεϊνικής ακολουθίας Compare the new sequence with PROSITE and BLOCKS in order to locate well-characterized sequence motifs. Search in DNA and protein sequence databases (Genbank, Swiss-Prot, etc.) for locating sequences that are locally similar (when using a local similarity criterion) – using FASTA and BLAST If these searches provide interesting results, then use the technique of dynamic programming. When we need to involve amino acid substitution matrices, we usually use a variant of the Dayhoff PAM matrix and BLOSUM matrix.

BLAST Algorithm Αλγόριθμος Είδος query Είδος sequence BLASTP Πρωτείνη BLAST: Basic Local Alignment Search Tool, Altschul et. Al. 1990 Basic idea: locate common sub-sequences of the same length (segment pairs) that appear in the input query sequence and the set of data base sequences. In the sequel extend in order to locate maximal segment pairs Αλγόριθμος Είδος query Είδος sequence BLASTP Πρωτείνη Πρωτεϊνη BLASTN Νουκλεοτίδιο BLASTX TBLASTN TBLASTX

O Αλγόριθμος FASTA FAST: Fast – All, Lipman et al. 1985 Central idea: use of small words (words ή k-tuples) that appear in both sequences. In the case of protein sequences the word length is το 1-2 residues while for DNA sequences the word length can reach 6 bases

Algorithm FASTA 1o step: we search for words of length ktup in the dynamic programming table: ‘hot spots’ (pairs (i,j) ) 2ο step: we locate the ten better diagonal runs– diagonal runs από ‘hot-spots’ στον πίνακα (a hot spot define the (i-j)-diagonal, the score is the sum of scores of hot-spots plus weighted decreasing as the distance increases) 3ο step: we align «good sub-alignments» 4ο step: we produce the best path

Άλλα Θέματα Amino Acid Substitution Matrices Dayhoff PAM (Point Accepted Mutation) BLOSUM (derived from the BLOCKS database) BLOCKS is a database of protein motifs that attempts to represent the most highly conserved substrings in aminoacid sequences or related preoteins.

Indexing Approaches k-gram indexing direct indexing vector space indexing

k-gram Indexing Hash tables (FASTP, FASTA, BLAST, BL2SEQ, PSI-BLAST, MegaBLAST, BLASTZ, WU-BLAST, BLAT, SSAHA, SENSEI) ed-tree (the gram size k, the skip interval Δ, the segment length vector H=[h1, …, ht], where sumi(hi)=k) -- For each sequence, the algorithm generates all k-grams with Δ skips -- Each gram is partitioned according to H, and the partitions are inserted into the ed-tree. The search algorithm partitions accordingly the query string, and initiates a search procedure, beginning from the root. Some important properties: (1) the k-grams are usually longer than that of BLAST, (2) they allow inexact k-gram matches, (3) only one k-gram out of Δ k-grams is indexed, Ed-tree is used by CHAOS, LAGAN, DIALIGN.

Direct Indexing Suffix trees -- MUMmer, AVID, REPuter, MGA, QUASAR VP-trees (Vantage Point tree)

MUMmer Detecting MUM (Maximal Unique Matches), that is pair of subsequences (x’,y’) that exactly match and there is no other matching pair that contains x’, y’ simultaneously (just use the GST(x,y)) Find the backbone of the alignment: all the pairs (x’,y’) are sorted in increasing order of the position of x’. Next the longest sequence of MUMs whose subsequences from x, y are in sorted order is found. These form the backbone Closing gaps. The gaps between consecutive MUMs are aligned with the help of Smith-Waterman Similar tools are AVID, REPuter, MGA.

VP-trees (Vantage Point tree) The VP-tree has been adapted to sequence databases where the distance is the edit distance or the block edit distance. It can be applied to other distance functions as long as they are almost metric. The algorithm takes a database, D={s1,…,sn}, and chooses a sequence s as the root, while the median of the distances to s is computed. The two sets are: (i) the sequences that are closer to s than the median, (ii) the rest of the sequences. Given a query q, the sequences at distance r are found as follows: First, q is compared to the vantage sequence s, at the root node. Let M be the median distance. 1. If d(q,s)≤r, s is inserted to the result set 2. If d(q,s)≤r+M, then the left child is searched recursively. 3. If d(q,s)≥M-r, then the right child is searched recursively.

Vector Space Indexing These index structures, map sequences or subsequences to vectors in a vector space. There exist two important index structures: SST (Sequence Search Tree) MRS (Multi Resolution String) index

SST Vector Space Mapping (window size w, shift amount Δ, tuple size k) The parameter k, determines the size of the computed vector (for alphabet size σ this vector has size σκ The produced vectors are stored in a so called centroid structure. A query is performed as follows: a query sequence is first divided into subsequences of window size w, using a shift amount of Δ=w/2. Each of the produced query vectors is then searched on the index structure starting from the root node.