2016-3-191 Sequence comparison and database search.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Heuristic alignment algorithms and cost matrices
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Sequence similarity.
Dynamic Programming and Biological Sequence Comparison Part I.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
1 Sequences comparison 1 Issues Similarity gives a measure of how similar the sequences are. Alignment is a way to make clear the correspondence between.
1 Lesson 3 Aligning sequences and searching databases.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 3: SEQUENCE ALIGNMENT * Chapter 3: All in the family.
Developing Pairwise Sequence Alignment Algorithms
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
近似搜索 邹权 博士、助理教授
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence comparison: Local alignment
Sequence Alignment 11/24/2018.
Intro to Alignment Algorithms: Global and Local
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Presentation transcript:

Sequence comparison and database search

strings String  String (sequence): an ordered succession of characters (symbols)  Alphabet: (1)DNA alphabet {A,C,G,T} of nucleotides (核苷) ; (2) 20-charater alphabet of amino acids.  Length: |s|; s[i]; 1…|s|; s=AATGCA, s[3]=T  Empty string: |ε|=0

strings Subsequence and Supersequence  Subsequence: subsequence of s is a sequence that can be obtained from s by removal of some characters.  Supersequence: when sequence t is a subsequence of sequence s, we say that s is a supersequence of t.

Strings Substring and Superstring:  Substring: substring of s is a string formed by consecutive characters of s, in the same order as they appear in s.  Superstring: when sequence t is a substring of sequence s, we say that s is a superstring of t.  Interval (区间) : of a string s is a set of consecutive indices...  s=[i..j] denotes the empty string when i=j+1.  therefore, for any substring t of s there is at least one interval [i..j] of s with t=s[i..j]

strings Concatenation (拼接)  Concatenation of two strings st, t after s  Concatenation of the same string: by raising s to a suitable power, s 3 =sss Prefix  Prefix of s is any substring of s of the form s[1..j] for 0≤j≤|s| ⇔ s=tu  Prefix(s, k): to refer to the prefix of s with exactly k characters with 0≤k≤|s| Suffix: s[i..|s|] for 1≤i≤|s| +1

sequence comparison and database search Outline  What is sequence comparison  How to compare sequences 1. compare two sequences a. compare all of the sequences b. compare parts of the sequences 2. compare more sequences

biological background TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

9 3.1 biological background  Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.

biological background a correspondence between elements of two sequences with order (topology) kept pairwise alignment: 2 sequences aligned multiple alignment: alignment of 3 or more sequences FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR

biological background  Four cases, task and application (1) 2 sequences: tens of thousands (10,000) of characters; isolated difference: insertions, deletions, substitutions / rarely as one each hundred (100) characters, to find the difference, two different sequencings (2) 2 sequences : whether a prefix similar to a suffix; (3) 2 sequences : whether there are two similar substrings, one from each sequence, similar, to analyze conservation sequence

comparing two sequences alignments involving:  global comparisons: entire sequences  local comparisons: just substrings of sequences. (Liao L, Noble WS: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J Comput Biol 2003, 10(6): )

global comparison- example example of aligning GACGGATTAG GATCGGAATAG GA –CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash

global comparison- the basic algorithm Definitions  Alignment: insertion of spaces: same size creating a correspondence: one over the other Both spaces are not allowed (Spaces can be inserted in beginning or end)  Scoring function : a measure of similarity between elements (nucleotides, amino acids, gaps); a match: +1/ identical characters a mismatch: -1/ distinct characters

global comparison- the basic algorithm a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces GA –CGGATTAG GATCGGAATAG What is the score? similarity : sim(s, t) maximum alignment score; many alignments with similarity  best alignment alignment with similarity

global comparison- the basic algorithm Needleman-Wunsch Basic DP algorithm for comparison of two sequences  number of alignment between two sequences: exponential  Efficient algorithm dynamic programming (DP): prefixes: shorter to larger An example: S=AAAC T=AGC

Idea (m+1)*(n+1) array: entry (i, j) is similarity between s  1..i  and t  1..j  p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]:

global comparison- the basic algorithm A A A C AGC

global comparison- the basic algorithm 2. a good computing order:  row by row: left to right on each row  column by column: top to bottom on each column  other order: to make sure a[i, j-1], a[i-1, j], and a[ i-1, j-1] are available when a[i, j] must be computed. 3. notes:  parameter g: specifying the space penalty (usually g<0)/g=-2  scoring function p for pairs of characters/p(a,b)=1 if a=b, and p(a,b)=-1 if a!=b.

global comparison- the basic algorithm Algorithm Similarity input: sequence s and t output: similarity between s and t m←|s| n←|t| for i←0 to m do a[i, 0] ←i×g for j←0 to n do a[0, j] ←j×g for i←1 to m do for j←1 to n do a[i, j] ←max(a[i, j-1]+g, a[ i-1, j-1]+p(i,j), a[i-1, j]+g) return a[m,n]

optimal alignments How to construct an optimal alignment between two sequences ← similarity Idea of Algorithm Align  All we need to do is to start at entry (m, n) and follow the arrows until we get to (0, 0).  An optimal alignment can be easily constructed from right to left if we have the matrix a computed by the basic algorithm.  The variables align-s and align-t are treated as globals in the code.  Call Align(m, n, len) will construct an optimal alignment  Note: max(|s|, |t|)≤len≤m+n

Recursive algorithm for optimal alignment Algorithm Align input: indices i, j, array a given by algorithm Similarity output: alignment in align-s, align-t, and length in len if i=0 and j=0 then len← 0 else if i>0 and a[i, j]= a[i-1, j]+g then Align(i-1, j, len) len← len+1 align-s[len] ←s[i] align-t[len] ←- else if i>0 and j>0 and a[i, j]= a[ i-1, j-1]+p(i,j) then Align(i-1, j-1, len) len← len+1 align-s[len] ←s[i] align-t[len] ← t[j] else //has to be j>0 and a[i, j]= a[i, j-1]+g Align(i, j-1, len) len← len+1 align-s[len] ←- align-t[len] ← t[i]

optimal alignments Arrow preference  When there is choice, a column with a space in t has precedence over a column with two symbols, which in turn has precedence over a column with a space in s  AAAC rather than AAAC AG -C - AGC maximum preference minimum preference

optimal alignments  Complexity of the algorithms for time and space: Basic dynamic programming: comparison of two sequences/ to compute Similarity: O(mn) or O(n 2 ) Recursive algorithm for optimal alignment: O(len)=O(m+n)

Scoring matrix Scoring function: p(a, a)=1 p(a, b)=-1 p(a, -)=p(-, b)=-2

ATCG A5-4 T 5 C 5 G 5 BLAST matrix the default matrix for BLAST nucleic acid matix (1)

ATCG A1-5 T-51-5 C 1-5 G-5 1 transfer matrix nucleic acid matix (1)

PAM (Point Accepted Mutation) : — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. protein substitution matrix

Blocks Substitution Matrix (BLOSUM )  Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins  Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity

BLOSUM62 VDSCY and VESLCY gap d=-11

global comparison (1) GapVDSCY 01gap2gap… V1gap E2gap S… L C Y

global comparison (2) GapVDSCY V-11S ij E-22 S-33 L-44 C-55 Y-66

global comparison (3) i j GapVDSCY V-11S ij E-22 S-33 L-44 C-55 Y-66 Needleman-Wunsch ; S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)

BLOSUM62

global comparison (4) GapVDSCY V-114 E-22 S-33 L-44 C-55 Y Needleman-Wunsch ; S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)

BLOSUM62 替代矩阵

global comparison (5) GapVDSCY V E-22 S-33 L-44 C-55 Y V  D: -3

BLOSUM62 替代矩阵

result : V D S – C Y V E S L C Y GapVDSCY V E S L C Y

local comparison Problem:  local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences? Example: AATC AATC Which one is better? AAT - AACT

local comparison Idea:  Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j].  Initialization First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.

Local comparison An example L D S C H G E S L C K  To find the optimal alignment

Local comparison (1) GapLDSCH G0S ij E0 S0 L0 C0 K0 Smith-Waterman ; S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -11

BLOSUM62

Local comparison (2) GapLDSCH G00 E0 S0 L0 C0 K

Local comparison (3) GapLDSCH G000 E0 S0 L0 C0 K0 -11

Result : GapLDSCH G E S L C K L D S – C H G E S L C K Does it make sense?

local comparison score 1. Smith-waterman score =9 L D S – C H G E S L C K

revisit this example Change the gap from -11 to -4, what will we get? L D S C H G E S L C K  To find the optimal alignment

Local comparison (1) GapLDSCH G0S ij E0 S0 L0 C0 K0 Smith-Waterman ; S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -4

BLOSUM62

Local comparison (2) GapLDSCH G00 E0 S0 L0 C0 K0 -4

Local comparison (3) GapLDSCH G000 E0 S0 L0 C0 K0 -4

Result : GapLDSCH G E S L C K L D S – C H G E S L C K

local comparison score 1. Smith-waterman score =11 L D S – C H G E S L C K

comparing multiple sequences Multiple sequence alignments are used for many reasons, including: (1) to detect regions of variability or conservation in a family of proteins, (2) to provide stronger evidence than pairwise similarity for structural and functional inferences.

comparing multiple sequences motivation  multiple alignment (MA): which parts of the sequences are similar and which parts are different / s 1, …, s k  multiple alignment is a generalization of pairwise alignment, similar operation no column made exclusively of spaces

comparing multiple sequences  Amino acid sequences: are more common with proteins  How to evaluate different MAs of the same set of sequences?

comparing multiple sequences  Scoring scheme:  (1) SP measure: scoring a alignment based on pairwise alignments.  (2) star alignment

the SP (sum-of-pairs) measure Scoring MA  additive functions here:  “Reasonable” properties (1)Functions: independent of order of sequences,i.e SP(I,-,I,V)=SP(V,I,I-) (2)To reward presence of many equal or strongly related residues and penalize unrelated residues and spaces

the SP measure sum-of-pairs (SP) function is a function which meets the two properties E.g., SP-score(I, -, I, V)=P(I, -)+ P(I, I)+ P(I, V)+ P(-, I)+ P(-, V)+ P(I, V) ( match = 1, a mismatch = -1, and a gap = -2)  SP(I,-,I,V)  = score(I,-) + score(I, I) +score(I,V) + score(-,I) + score (-,V) + score(I,V)  = = -7

the SP measure Although there is never an entire column of gaps, if we look at any 2 sequences in the alignment, there may be columns where both have gaps p(-, -)=0

the SP measure Induced pairwise alignment/ projection of a multiple alignment E.g., In MA, select two of sequences / forget all the rest / remove columns with two spaces and derive a true PA (induced pairwise alignment) PEAALYGRFT---IKSDVM PEALNYGRY---SSESDVW PEAALYGRFT-IKSDVM PEALNYGRY-SSESDVW α ij : PA induced by α on s i and s j

3.4.1 the SP measure summary Way 1: compute scores of each column, and then add all column scores Way 2: compute scores for induced PA, and then add these scores

star alignments Heuristic method for multiple sequence alignments Select a sequence s c as the center of the star For each sequence s 1, …, s k such that index i  c, perform a global alignment Aggregate alignments with the principle “ once a gap, always a gap. ”

star alignments For example, say your sequences are: S1 A T T G C C A T T S2 A T G G C C A T T S3 A T C C A A T T T T S4 A T C T T C T T S5 A C T G A C C (1) To find the center sequence

star alignments (2) do pairwise alignments

star alignments (3) build the multip le align ment