Pairwise Sequence Alignment and Database Searching

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Last lecture summary.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence Analysis Tools
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Pairwise Sequence Alignment (cont.)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Pairwise Sequence Alignment and Database Searching

Objectives What is the function of this gene? Do other genes have this functional motif? Can I predict the higher order structure of this protein? Is this gene a member of a known gene family? Do other organisms have this gene?

Objectives What is the function of this gene? Do other genes have this functional motif? Can I predict the higher order structure of this protein? Is this gene a member of a known gene family? Do other organisms have this gene?

Intuition “Similar” sequences should have (long) regions of similar/identical residues. Why? Evolution: descent from a common ancestral sequence Functional/structural convergence

A sequence similarity search is an application of pairwise sequence alignment!

Dotplots G A G C T T A G G G C C T T T G G G A A A

General Alignment/Search Issues Search using amino acid sequence if possible Why? Protein evolution is slower than DNA sequence evolution Ask the program to translate your query sequence in all 6 possible reading frames. Statistical theory is based on unrealistic assumptions; consider searches as exploratory analyses.

Alignment Jargon ancestor Evolutionarily related sequences differ from one other because of several processes: Substitutions Insertions Deletions A B Observed sequences

Alignment Jargon GCG Substitution || ACG ACG A-G 1 mismatch 2 matches

Alignment Jargon ATCG Insertion | || A-CG ACG 0 mismatches 3 matches 1 gap

Alignment Jargon ATCG Deletion | || A-CG ATCG 0 mismatches 3 matches 1 gap

Indel: INsertion or DELetion Alignment Jargon Results of insertion and deletion events can be indistinguishable. Indel: INsertion or DELetion

Sequence Alignment Sequence alignment is simply the “optimal” assignment of substitution and indel events to a pair of sequences. Global alignment: align entire sequences Local alignment: find best matching regions of sequences

Dotplots G A G C T T A G G G C C T T T G G G A A A

Measuring Alignment Quality Good alignments should have … “many” exact matches “few” mismatches “many” of the mismatches should be similar residues “few” gaps

Measuring Alignment Quality Begin with... Longest Exact Match QTRPQNVLNPP ||| STRQNVINPWAAQ S = 3a S=alignment score a=match score

Measuring Alignment Quality … allow some mismatches QTRPQNVLNPP ||| || STRQNVINPWAAQ S=alignment score a=match score b=mismatch penalty S = 5a - 1b

Measuring Alignment Quality …and finally, introduce some gaps QTRPQNVLNPP || ||| || STR-QNVINPWAAQ S=alignment score a=match score b=mismatch penalty c=gap penalty S = 7a - 1b -1c

Scoring Issues Relative costs of matches, mismatches, and gaps should depend on their probabilities (rare events receive higher penalties) In practice, the appropriate costs are rarely known. A variety of scoring matrices are available.

Scoring Matrices Scoring matrix specifies a score, sij, for aligning aa i with aa j. Choice of matrix depends (ideally) on the divergence level of desired/expected hits. Examples: PAM, BLOSUM Both can be modified for different divergence levels (eg, BLOSUM40, BLOSUM62) Advice: try several matrices when possible.

Dayhoff Family of Matrices Based on an empirically-derived, probabilistic model of protein evolution. Using closely-related sequences, count the frequencies of different types of amino acid substitutions, and use these frequencies to construct the scoring matrix. Extrapolate to higher levels of divergence

Dayhoff Family of Matrices Dayhoff model measures sequence evolution in units of “PAMs” 2 sequences separated by k PAMs are expected to have experienced k substitutions per 100 amino acid sites since they diverged from a common ancestral sequence Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change) Mutability of alanine was defined to be 100.

Dayhoff Family of Matrices Problems with the original Dayhoff scheme It does not consider the genetic code. Not all amino acid substitutions can occur by a single nucleotide substitution event. Parameters were estimated from a small sample of closely related proteins. Evolution at the “average site” of the “average protein” is being modeled.

BLOSUM Scoring Matrices Construct a database of “blocks”: ungapped, aligned conserved regions of proteins) Cluster sequences within a block that are more similar than a chosen threshold (eg, 62% for BLOSUM62) Represent each cluster of sequences using their “average” sequence

BLOSUM Scoring Matrices After the averaging procedure, each modified block consists of average sequence(s) and sequences that did not cluster with the others Examine all pairs of sequences in the modified blocks. Tabulate pij the probability of observing aa i in one sequence and aa j in a second sequence.

BLOSUM Scoring Matrices Similarly, tabulate i, the probability that a particular position in a block will be aa i. The ratio of the probability of observing the residue pair ij in in two related sequences to the same probability for unrelated sequences is where

BLOSUM Scoring Matrices The logarithm of this ratio serves as a score for pairing aa i with aa j. Why 62%? Henikoff and Henikoff found that use of this threshold in conjunction with BLAST was more effective than other values for correctly identifying which sequences belonged in which groups

Scoring Indels GCCTATTG GCCTATTG | ||| | ||| AC--TTTG A-C-TTTG | ||| | ||| AC--TTTG A-C-TTTG Using a scoring scheme of match (a), mismatch (-b), gap (-c), each of these alignments would score: 4a - 2b -2c

GCCTATTG GCCTATTG | ||| | ||| A-C-TTTG AC--TTTG A more biologically realistic scoring scheme is to have a relatively large penalty for opening, or initiating, a gap, and a smaller penalty for each position the gap is extended. GCCTATTG GCCTATTG | ||| | ||| A-C-TTTG AC--TTTG Using a scoring scheme of match (a), mismatch (-b), gap open (-i), gap extend (-e), the alignment scores become: 4a - 2b - 2i and 4a - 2b - i - e

BLAST (www.ncbi.nlm.nih.gov/BLAST) Basic Local Alignment Search Tool Main idea: Good alignments are likely to have two or more short (3+ residues) high-scoring words. Observation: Similarities of interest are usually longer than a single word, so look for multiple hits on the same diagonal, separated by a short distance

T, A, c, and Sg are adjustable constants BLAST: basic strategy 1. Find pairs of high scoring words (>T), separated by no more than A positions, on the same diagonal (ie, no gaps). 2. Ungapped extension (fast) 3. If extended HSP scores >Sg, then 4. Perform gapped extension (slow) 5. Report if “E-val” is lower than c T, A, c, and Sg are adjustable constants

Why do BLOSUM matrices seem to outperform Dayhoff matrices? Good guess: Dayhoff matrices are based on changes seen among closely related sequences. Searches tend to target ancient homologies. Different sites in proteins evolve with different rates and patterns. The remaining sequence similarities after long periods of time are likely to be in functionally constrained regons

Why do BLOSUM matrices seem to outperform Dayhoff matrices? Changes counted by Dayhoff methods may tend to be in unconstrained, quickly evolving regions

Statistical Significance: E-values Let Pi be the frequency of aa i. For unrelated sequences, an alignment of i with j has probability Pij. Given Pij and sij, we can calculate normalized scores (“bit scores”) from the raw score, S: K is a function of the database size,  is a function of the scoring matrix

Statistical Significance: E-values When 2 random sequences of length m and n are aligned, the expected number of HSPs with normalized scores greater than S´ is approximately

Statistical Significance: E-values

Statistical Significance: E-values The preceding theory is strictly correct for ungapped alignments only. Empirical observations suggest that it may apply approximately to gapped alignments as well.

Ramble about deriving distribution of BLAST scores if time allows.

Consider aligning 2 sequences of length n and m. How many alignments? Consider aligning 2 sequences of length n and m. , where [Perspective: approximately 1080 particles in the universe]

Dynamic Programming Alignment of 2 Sequences Consider the global alignment of sequences a and b. Define d(x,y) = cost of aligning an x with a y. d(A,A) = 5 match score d(A,T) = -2 mismatch penalty d(A,–) = -3 gap penalty

We can think of alignment as a series of decisions We can think of alignment as a series of decisions. At each step, we decide among three possible alternatives: Add the next residues from both sequences (match or mismatch) Add the next residue from a (gap in b) Add the next residue from b (gap in a)

The “best” alignment is the series of decisions that maximizes the total score: A–CC ATCC

Let Dij be the score of aligning the first i residues of a with the first j residues of b. Then,

Match:+5 Mismatch: -2 Gap: -3 Match:+5 Mismatch: -2 Gap: -3 -3 -6 -9 -12 -15 -18 2 -1 -4 -3 7 4 1 -7 -10 12 9 6 10 -6 -2 15 -3 -2 -6 -9 -12 -15 -5 -8 -11 -7

A: AGCTTA ||| | B: -GCTGA The optimal global alignments is the following: A: AGCTTA ||| | B: -GCTGA It has a score of 15: 4 matches, 1 mismatches, and 1 gap. (20-2-3=15)

Align a=ACTCG with b=ACCTG Align a=ACTCG with b=ACCTG. Use match score +4, mismatch penalty -1, and gap penalty -2. 0 A C C T G A C T G