. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.

Slides:

Advertisements

Similar presentations

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Lecture 8 Alignment of pairs of sequence Local and global alignment

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.

Sequence Similarity Searching Class 4 March 2010.

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Introduction to Bioinformatics Algorithms Sequence Alignment.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Introduction to bioinformatics

Sequence Analysis Tools

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Introduction to Bioinformatics Algorithms Sequence Alignment.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence Alignments Revisited

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

1 Lesson 3 Aligning sequences and searching databases.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Basics of Sequence Alignment and Weight Matrices and DOT Plot

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

An Introduction to Bioinformatics

Pairwise Sequence Alignment

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

BLAST Workshop Maya Schushan June 2009.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.

Construction of Substitution matrices

DNA, RNA and protein are an alien language

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

Presentation transcript:

. Sequence Alignment and Database Searching

2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary history.  Evolutionary history can tell us a lot about properties of a given gene  Homology can be inferred from similarity between the genes u Searching for Proteins with same or similar functions

3 Orthologous and Paralogous u Orthologous sequences differ because they are found in different species (a speciation event) u Paralogous sequences differ due to a gene duplication event u Sequences may be both orthologous and paralogous

4 Protein Evolution “For many protein sequences, evolutionary history can be traced back 1-2 billion years” -William Pearson u When we align sequences, we assume that they share a common ancestor  They are then homologous u Protein fold is much more conserved than protein sequence u DNA sequences tend to be less informative than protein sequences

5 Sequence Alignment Input: two sequences S 1, S 2 over the same alphabet Output: two sequences S’ 1, S’ 2 of equal length ( S’ 1, S’ 2 are S 1, S 2 with possibly additional gaps) Example:  S 1 = GCGCATGGATTGAGCGA  S 2 = TGCGCCATTGATGACC u A possible alignment: S’ 1 = -GCGC-ATGGATTGAGCGA S’ 2 = TGCGCCATTGAT-GACC-- Goal: How similar are two sequences S 1 and S 2 Global Alignment:

6 Input: two sequences S 1, S 2 over the same alphabet Output: two sequences S’ 1, S’ 2 of equal length ( S’ 1, S’ 2 are substrings of S 1, S 2 with possibly additional gaps) Example:  S 1 = GCGCATGGATTGAGCGA  S 2 = TGCGCCATTGATGACC u A possible alignment: S’ 1 = ATTGA-G S’ 2 = ATTGATG Goal: Find the pair of substrings in two input sequences which have the highest similarity Local Alignment: Sequence Alignment

7 -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements:  Perfect matches  Mismatches  Insertions & deletions (indel) u Score each position independently u Score of an alignment is sum of position scores

8 Global vs. Local Alignments u Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached. u Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.

9

10 Global Alignment (Needleman -Wunsch) u The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences (needle) u Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.  Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared. u Global methods are useful when you want to force two sequences to align over their entire length

11 Local Alignment (Smith-Waterman) u Local alignment  Identify the most similar sub-region shared between two sequences  Smith-Waterman

12 Properties of Sequence Alignment and Search u DNA  Should use evolution sensitive measure of similarity  Should allow for alignment on exons => searching for local alignment as opposed to global alignment u Proteins  Should allow for mutations => evolution sensitive measure of similarity  Many proteins do not display global patterns of similarity, but instead appear to be built from functional modules => searching for local alignment as opposed to global alignment

13 Parameters of Sequence Alignment Scoring Systems: Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: Opening: The cost to introduce a gap Extension: The cost to elongate a gap

14 Pairwise Alignment u The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.  There are lots of possible alignments. u Two sequences can always be aligned. u Sequence alignments have to be scored. u Often there is more than one solution with the same score.

15 Methods of Alignment u By hand - slide sequences on two lines of a word processor u Dot plot  with windows u Rigorous mathematical approach  Dynamic programming (slow, optimal) u Heuristic methods (fast, approximate)  BLAST and FASTA  Word matching and hash tables0

16 Align by Hand GATCGCCTA_TTACGTCCTGGAC <-- --> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment

17 Percent Sequence Identity u The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G 70% identical mismatch indel

18 Considerations The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). The smaller the window, the larger the weight of statistical (unspecific) matches. With large windows the sensitivity for short sequences is reduced. Insertions/deletions are not treated explicitly.

19 Alignment methods u Rigorous algorithms = Dynamic Programming  Needleman-Wunsch (global)  Smith-Waterman (local) u Heuristic algorithms (faster but approximate)  BLAST  FASTA

20 Dynamic Programming  Dynamic Programming is a very general programming technique.  It is applicable when a large search space can be structured into a succession of stages, such that:  the initial stage contains trivial solutions to sub- problems  each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage  the final stage contains the overall solution

21 If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: x i and y j are aligned, F(i,j) = F(i-1,j-1) + s(x i,y j ) x i is aligned to a gap, F(i,j) = F(i-1,j) - d y j is aligned to a gap, F(i,j) = F(i,j-1) - d The best score up to (i,j) will be the largest of the three options Creation of an alignment path matrix

22 Creation of an alignment path matrix Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x 1...i of x up to x i and the initial segment y 1...j of y up to y j Build F(i,j) recursively beginning with F(0,0) = 0

23

24 H E A G A W G H E E P A W H E A E Backtracking A-A EEEE HHHH G-G- WWWW AAAA G-G- APAP E-E- H-H Optimal global alignment: EEEE

25 DNA Scoring Systems -very simple actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Sequence 1 Sequence 2 AGCTA1000G0100C0010T0001AGCTA1000G0100C0010T0001 Match: 1 Mismatch: 0 Score = 5

26 Protein Scoring Systems PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Sequence 1 Sequence 2 Scoring matrix T:G= -2 T:T = 5 Score= 48 CSTPAGND.. C 9 S-1 4 T P A G N D CSTPAGND.. C 9 S-1 4 T P A G N D

27 Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. C P G G A V I L M F Y WH K R E Q D N S T C SH S+S positive charged polar aliphatic aromatic small tiny hydrophobic Protein Scoring Systems

28 Scoring matrices reflect: – # of mutations to convert one to another – chemical similarity – observed mutation frequencies – the probability of occurrence of each amino acid Widely used scoring matrices: PAM BLOSUM Protein Scoring Systems

29 PAM matrices u Family of matrices PAM 80, PAM 120, PAM 250 u The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based u Greater numbers denote greater distances

30. PAM (Percent Accepted Mutations) matrices The numbers of replacements were used to compute a so-called PAM-1 matrix. The PAM-1 matrix reflects an average change of 1% of all amino acid positions. PAM matrices for larger evolutionary distances can be extrapolated from the PAM-1 matrix. PAM250 = 250 mutations per 100 residues. Greater numbers mean bigger evolutionary distance

31 PAM (Percent Accepted Mutations) matrices Derived from global alignments of protein families. Family members share at least 85% identity ( Dayhoff et al., 1978 ). Construction of phylogenetic tree and ancestral sequences of each protein family Computation of number of replacements for each pair of amino acids

32 PAM - limitations u Based on only one original dataset u Examines proteins with few differences (85% identity) u Based mainly on small globular proteins so the matrix is biased

33 Derived from alignments of domains of distantly related proteins ( Henikoff & Henikoff,1992 ). Occurrences of each amino acid pair in each column of each block alignment is counted. The numbers derived from all blocks were used to compute the BLOSUM matrices. AACECAACEC A - C = 4 A - E = 2 C - E = 2 A - A = 1 C - C = 1 BLOSUM (Blocks Substitution Matrix) AACECAACEC

34 BLOSUM matrices u Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments) u BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity u BLOSUM62 represents closer sequences than BLOSUM45

35 BLOSUM (Blocks Substitution Matrix) Sequences within blocks are clustered according to their level of identity. Clusters are counted as a single sequence. Different BLOSUM matrices differ in the percentage of sequence identity used in clustering. The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix. Greater numbers mean smaller evolutionary distance.

36 The Blosum50 Scoring Matrix

37 PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM200 = BLOSUM52 PAM250 = BLOSUM45 PAM250 = BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM62 for general use BLOSUM80 for close relations BLOSUM80 for close relations BLOSUM45 for distant relations BLOSUM45 for distant relations PAM120 for general use PAM120 for general use PAM60 for close relations PAM60 for close relations PAM250 for distant relations PAM250 for distant relations

38 PAM Matricies u PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times best for short alignments with high similarity u PAM prepared by multiplying PAM 1 by itself a total of 120 times best for general alignment u PAM prepared by multiplying PAM 1 by itself a total of 250 times best for detecting distant sequence similarity

39 BLOSUM Matricies u BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity u BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default) u BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments

40 TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches ( Henikoff & Henikoff, 1993 ). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.

41 T A T G T G G A A T G A Scoring Insertions and Deletions A T G T - - A A T G C A A T G T A A T G C A T A T G T G G A A T G A The creation of a gap is penalized with a negative score value. insertion / deletion

42 1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Why Gap Penalties? Gaps allowed but not penalized Score: 88 Gaps not permitted Score: 0 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29 Match = 5 Mismatch = -4

43 The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps. Why Gap Penalties?

44 Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end up with nonsense alignments. In real sequences, muti-base (or amino acid) gaps are quit common genetic insertion/deletion events “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.

45 BLOSUM vs PAM u BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100 More Divergent Less Divergent

46 What do the Score and the e-value really mean? u The quality of the alignment is represented by the Score (S). u The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically. u The significance of each alignment is computed as an E value (E). u Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

47 Notes on E-values u Low E-values suggest that sequences are homologous ๏ Can’t show non-homology u Statistical significance depends on both the size of the alignments and the size of the sequence database ‣ Important consideration for comparing results across different searches ‣ E-value increases as database gets bigger ‣ E-value decreases as alignments get longer

48 Homology: Some Guidelines u Similarity can be indicative of homology u Generally, if two sequences are significantly similar over entire length they are likely homologous u Low complexity regions can be highly similar without being homologous u Homologous sequences not always highly similar

49 Suggested BLAST Cutoffs u For nucleotide based searches, one should look for hits with E-values of or less and sequence identity of 70% or more u For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more Take Home Message: Always look at your alignments