Center for Biological Sequence Analysis

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Structural bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Sequences are related Darwin: all organisms are related through descent with modification Related molecules have similar functions in different organisms.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Anders Gorm Pedersen Center for Biological Sequence Analysis
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans.
Sequences are related Darwin: all organisms are related through descent with modification Related molecules have similar functions in different organisms.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Alignment and Database Searching Henrik Nielsen Protein Post-Translational Modification & Molecular Evolution Groups Center for Biological Sequence.
Pairwise Alignment and Database Searching Slides retirados de... Henrik Nielsen Protein Post-Translational Modification & Molecular Evolution Groups Center.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Arun Goja MITCON BIOPHARMA
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Multiple Alignment Anders Gorm Pedersen / Henrik Nielsen
Introduction to sequence alignment Mike Hallett (David Walsh)
Pairwise Alignment and Database Searching
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence Alignment 11/24/2018.
Pairwise Sequence Alignment (cont.)
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Center for Biological Sequence Analysis Pairwise Alignment Anders Gorm Pedersen Henrik Nielsen Center for Biological Sequence Analysis

Sequences are related Phylogenetic tree based on ribosomal RNA: • Darwin: all organisms are related through descent with modification • => Sequences are related through descent with modification • => Similar molecules have similar functions in different organisms Phylogenetic tree based on ribosomal RNA: three domains of life

Sequences are related, II Phylogenetic tree of globin-type proteins found in humans

Why compare sequences? • Determination of evolutionary relationships • Prediction of protein Protein 1: binds oxygen function and structure (database searches). Sequence similarity Protein 2: binds oxygen ?

Dotplots: visual sequence comparison 1. Place two sequences along axes of plot 2. Place dot at grid points where two sequences have identical residues 3. Diagonals correspond to conserved regions

Pairwise alignments 43.2% identity; Global alignment score: 374 10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

Global versus local alignments Global alignment: align full length of both sequences. Global alignment Local alignment: find best partial alignment of two sequences Alignment length Seq 1 Local alignment Seq 2

Pairwise alignment 100.000% identity in 3 aa overlap SPA ::: Percent identity is not a good measure of alignment quality

Pairwise alignments: alignment score 43.2% identity; Global alignment score: 374 10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

Alignment scores: match vs. mismatch Simple scoring scheme (too simple in fact…): Matching amino acids: 5 Mismatch: 0 Scoring example: K A W S A D V : : : : : K D W S A E V 5+0+5+5+5+0+5 = 25

Pairwise alignments: conservative substitutions 43.2% identity; Global alignment score: 374 10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

Amino acid properties Serine (S) and Threonine (T) have Aspartic acid (D) and Glutamic similar physicochemical properties acid (E) have similar properties => Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities

Pairwise alignments: insertions/deletions 43.2% identity; Global alignment score: 374 10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:. .: .:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

Alignment scores: insertions/deletions K L A A S V I L S D A L K L A A - - - - S D A L -10 + 3 x (-1)=-13 Affine gap penalties: Multiple insertions/deletions may be one evolutionary event => Separate penalties for gap opening and gap elongation

Handout Compute 4 alignment scores: two different alignments using two different alignment matrices (and the same gap penalty system) Score 1: Alignment 1 + BLOSUM-50 matrix + gaps Score 2: Alignment 1 + ID-6,3 matrix + gaps Score 3: Alignment 2 + BLOSUM-50 matrix + gaps Score 4: Alignment 2 + ID-6,3 matrix + gaps

Handout: summary of results Alignment 1 Alignment 2 BLOSUM-50 ID-6,3

Protein substitution matrices: different types • Identity matrix (match vs. mismatch) • Genetic code matrix (how similar are the codons?) • Chemical properties matrix (use knowledge of physicochemical properties to design matrix) • Empirical matrices (based on observed pair-frequencies in hand-made alignments)  PAM series  BLOSUM series  Gonnet

Estimation of the PAM1 matrix 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .::.::::: :.....::.:.. .....::.:: ::.::: ::.::.. :. .:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF Start from given alignments of closely related proteins Count the aligned amino acid pairs (e.g., A aligned with A makes up 1.5% of all pairs. A aligned with C makes up 0.01% of all pairs, etc.)‏ Expected pair frequencies are computed from single amino acid frequencies. (e.g, fA,C=fA x fC=7% x 3% = 0.21%). For each amino acid pair the substitution scores are essentially computed as: Pair-freq(observed)‏ Pair-freq(expected)‏ 0.01% 0.21% log SA,C = log = -1.3 To obtain the PAM1 (1 Percent Accepted Mutations) matrix, normalize pair frequencies to 1% difference before applying the logarithm To obtain higher number PAM matrices, extrapolate the PAM1 matrix via matrix multiplication

Percent Accepted Mutations (PAM) PAM (Percent Accepted Mutations) can be used as a measure of evolutionary distance. Note: 100PAM does NOT mean that sequences are 100% different! In the “Twilight Zone”, it becomes difficult to see whether sequences are related

Estimation of the BLOSUM 50 matrix Use the BLOCKS database (ungapped alignments of especially conserved regions of multiple alignments) For each alignment in the BLOCKS database the sequences are grouped into clusters with at least 50% identical residues (for BLOSUM 50)‏ All pairs of sequences are compared between clusters, and the observed pair frequencies are noted Substitution scores are calculated as for the PAM matrix ID FIBRONECTIN_2; BLOCK COG9_CANFA GNSAGEPCVFPFIFLGKQYSTCTREGRGDGHLWCATT COG9_RABIT GNADGAPCHFPFTFEGRSYTACTTDGRSDGMAWCSTT FA12_HUMAN LTVTGEPCHFPFQYHRQLYHKCTHKGRPGPQPWCATT HGFA_HUMAN LTEDGRPCRFPFRYGGRMLHACTSEGSAHRKWCATTH MANR_HUMAN GNANGATCAFPFKFENKWYADCTSAGRSDGWLWCGTT MPRI_MOUSE ETDDGEPCVFPFIYKGKSYDECVLEGRAKLWCSKTAN PB1_PIG AITSDDKCVFPFIYKGNLYFDCTLHDSTYYWCSVTTY SFP1_BOVIN ELPEDEECVFPFVYRNRKHFDCTVHGSLFPWCSLDAD SFP3_BOVIN AETKDNKCVFPFIYGNKKYFDCTLHGSLFLWCSLDAD SFP4_BOVIN AVFEGPACAFPFTYKGKKYYMCTRKNSVLLWCSLDTE SP1_HORSE AATDYAKCAFPFVYRGQTYDRCTTDGSLFRISWCSVT COG2_CHICK GNSEGAPCVFPFIFLGNKYDSCTSAGRNDGKLWCAST COG2_HUMAN GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATT COG2_MOUSE GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT COG2_RABIT GNSEGAPCVFPFTFLGNKYESCTSAGRSDGKMWCATS COG2_RAT GNSEGAPCVFPFTFLGNKYESCTSAGRNDGKVWCATT COG9_BOVIN GNADGKPCVFPFTFQGRTYSACTSDGRSDGYRWCATT COG9_HUMAN GNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATT COG9_MOUSE GNGEGKPCVFPFIFEGRSYSACTTKGRSDGYRWCATT COG9_RAT GNGDGKPCVFPFIFEGHSYSACTTKGRSDGYRWCATT FINC_BOVIN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT FINC_HUMAN GNSNGALCHFPFLYNNHNYTDCTSEGRRDNMKWCGTT FINC_RAT GNSNGALCHFPFLYSNRNYSDCTSEGRRDNMKWCGTT MPRI_BOVIN ETEDGEPCVFPFVFNGKSYEECVVESRARLWCATTAN MPRI_HUMAN ETDDGVPCVFPFIFNGKSYEECIIESRAKLWCSTTAD PA2R_BOVIN GNAHGTPCMFPFQYNQQWHHECTREGREDNLLWCATT PA2R_RABIT GNAHGTPCMFPFQYNHQWHHECTREGRQDDSLWCATT

Substitution matrices and sequence similarity Substitution matrices come as series of matrices calculated for different degrees of sequence similarity (different evolutionary distances). ”Hard” matrices ”Soft” matrices Designed for very similar sequences Designed for less similar sequences High numbers in the BLOSUM series (e.g., BLOSUM90)‏ Low numbers in the BLOSUM series (e.g., BLOSUM30)‏ Low numbers in the PAM series (e.g. PAM30) High numbers in the PAM series (e.g. PAM250) Severe mismatch penalties Less severe mismatch penalties Yield short alignments with high %identity Yield longer alignments with lower %identity

Pairwise alignment Optimal alignment: alignment having the highest possible score given a substitution matrix and a set of gap penalties So: best alignment can be found by exhaustively searching all possible alignments, scoring each of them and choosing the one with the highest score?

How many possible alignments are there? The problem: How many possible alignments are there? Consider two sequences of two letters each: AB and XY. How many ways are there to align them? Insert no gaps: AB XY Insert one gap in each sequence: A-B AB- A-B -AB AB- -AB XY- X-Y -XY X-Y -XY XY- Insert two gaps in each sequence: AB-- --AB A-B- -A-B A--B -AB- --XY XY-- -X-Y X-Y- -XY- X--Y In total: 13 ways!

How many possible alignments are there? The problem: How many possible alignments are there? Consider two sequences of length n1 and n2. How many ways are there to align them? n1 \ n2 1 2 3 4 5 7 9 11 13 25 41 61 63 129 231 321 681 1683

How many possible alignments are there? The problem: How many possible alignments are there? The number of possible pairwise alignments increases explosively with the length of the sequences: Two protein sequences of length 100 amino acids can be aligned in approximately 10 60 different ways Time needed to test all possibilities is same order of magnitude as the entire lifetime of the universe.

Pairwise alignment: the solution “Dynamic programming” (the Needleman-Wunsch algorithm)

Alignment depicted as path in matrix T C G C A T C A TCGCA TC-CA T C G C A T C A TCGCA T-CCA

Alignment depicted as path in matrix T C G C A T C A Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths). x Position labeled “x”: TC aligned with TC --TC -TC TC TC-- T-C TC

Dynamic programming: example A C G T A 1 -1 -1 -1 C -1 1 -1 -1 G -1 -1 1 -1 T -1 -1 -1 1 Gaps: -2

Dynamic programming: example

Dynamic programming: example

Dynamic programming: example -3 -6 -1

Dynamic programming: example -1

Dynamic programming: example

Dynamic programming: example

Dynamic programming: example T C G C A : : : : T C - C A 1+1-2+1+1 = 2

Global versus local alignments Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm). Global alignment Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Seq 1 Local alignment Seq 2

Local alignment overview • The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative. score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x,y) = max score(x-1,y) - gap-penalty • Trace-back is started at the highest value rather than in lower right corner • Trace-back is stopped as soon as a zero is encountered

Local alignment: example

Alignments: things to keep in mind “Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”. This is NOT necessarily the biologically most meaningful alignment. Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc. Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.