Lecture 1 Sequence Alignment. Sequence alignment: why? Early in the days of protein and gene sequence analysis, it was discovered that the sequences from.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Multiple Sequence Alignment. An alignment of heads.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Multiple Sequence Alignment. Multiple Alignment- First pair Align the two most closely-related sequences first. This alignment is then ‘fixed’ and will.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Multiple Sequence Alignment. Overview of ClustalW Procedure 1 PEEKSAVTALWGKVN--VDEVGG 2 GEEKAAVLALWDKVN--EEEVGG 3 PADKTNVKAAWGKVGAHAGEYGA 4 AADKTNVKAAWSKVGGHAGEYGA.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence comparison: Local alignment
Sequence Based Analysis Tutorial
Pairwise sequence Alignment.
Sequence Based Analysis Tutorial
Presentation transcript:

Lecture 1 Sequence Alignment

Sequence alignment: why? Early in the days of protein and gene sequence analysis, it was discovered that the sequences from related proteins or genes were similar, in the sense that one could align the sequences so that many corresponding residues match. This discovery was very important: strong similarity between two genes is a strong argument for their homology. Bioinformatics is based on it. Terminology: – Homology means that two (or more) sequences have a common ancestor. This is a statement about evolutionary history. – Similarity simply means that two sequences are similar, by some criterion. It does not refer to any historical process, just to a comparison of the sequences by some method. It is a logically weaker statement. However, in bioinformatics these two terms are often confused and used interchangeably. The reason is probably that significant similarity is such a strong argument for homology.

An example of a sequence alignment for two proteins (the protein kinase KRAF_HUMAN and the uncharacterized O22558 from Arabidopsis thaliana) using the BLAST program. Note: protein is expresses as a sequence of amino acids, represented by single letter alphabets.

Many genes have a common ancestor The basis for comparison of proteins and genes using the similarity of their sequences is that the the proteins or genes are related by evolution; they have a common ancestor. Random mutations in the sequences accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other more recently. Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.

A dotplot displays sequence similarity Hemoglobin A chain from human Erythrocruorin from Chironomus (insect) Symbols in matrix indicates degree of matching

Global alignment. Assumes that the two proteins are basically similar over the entire length of one another. Local alignment. Searches for segments of the two sequences that match well. Gaps and insertions. Match may be improved by putting gaps or inserting extra residues into one of the sequences. ---DFAHKAMM-PTWWEGCIL DXGHK-MMSPTW-ECAAL--- Some definitions

Scoring. Quantifies the goodness of alignment. Exact match has highest score, substitution lower score and insertion and gaps may have negative scores. Substitution matrix. A symmetrical 20*20 matri x (20 amino acids to each side). Each element gives a score that indicates the likelihood that the two residue types would mutate to each other in evolutionary time. Gap penalty. Evolutionary events that makes gap insertion necessary are relatively rare, so gaps have negative scores. Three types: –Single gap-open penalty. This will tend to stop gaps from occuring, but once they have been introduced, they can grow unhindered. –Gap penalty proportional to the gap length. Works against larger gaps. –Gap penalty that combines a gap-open value with a gap- length value.

s jk = 2 log 2 (q jk /e jk ) q jk – number of times j-k pair of residues seen together e jk – number of times j-k pair of residues expected to be together Substitution “log odds” matrix BLOSUM 62 Henikoff and Henikoff (1992; PNAS 89: )

( M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 5). s ij = log 10 M ij M ij = probability of replacement j to i per occurrence of residue j.

Let W(k) be the penalty for a gap of length k and sub(A,B) the substitution score for replacing A residue by B. The score for the alignment C - K H V F C R V C I C K K C - F C - K C V Is: 3 sub(C,C) + sub(K,K) + sub(H,C) + sub(F,F) + sub(V,K) + sub(I,V) + 3 W(1) If we use PAM 250 and gap penealty of 10, then score is 3x – – x(-10) = 19 Question is how to find the alignment with the highest scores.

There no single best alignment Optimal alignment. The alignment that is the best, given the scoring convention. There is no such thing as the single best alignment. Good alignment is given by a scoring systems based on solid biology.

The Needleman-Wunsch-Sellers Algorithm (NWS) Dynamical-progamming algoritm for finding the highest-scoring alignment of two sequences (given a scoring scheme). Let W(k) be the penalty for a gap of length k and sub(A,B) the substitution score for replacing A residue by B. Suppose we want align two short sequences: CKHVFCRVCI and CKKCFCKCV Needleman, S.B & Wunsch, C.D. (1970) J.Mol.Biol. 48: ; Sellers (Sellers, P.H. (1974), SIAM J.Appl.Math. 26:787

2. Find the score at site (I,j) from scores at previous sites and scoring scheme. D i,j = max {D i+1,j+1 + sub(A i, B j ); D i, j+k + W(k), k = 1 to L’-j; D i+k, j + W(k), k = 1 to L-i } 1. Line sequences to form a LxL’ matrix; L, L’ are length if the two sequences. i = 1 to L j = 1 to L ’

3. Fill the empty sites in column i=L and row j=L’ with zeroes. (Here use simple diagonal – 0 and 1 - substitution matrix and ignore gap penalty.)

4. Move in to the next column and row. To go to site (i,j), choose from among (i+1,j), (i,j+1) and (i+1,j+1) that has the the largest value, then add that value to the value of (i,j). E.g. Value of (6,8) was 1. Go to that site from (7,8). Value of (7,8) is 1. So value of (6,8) is updated to 2.

5. Repeat the process

6. Backtrace our path ending at the cell with value 5 at (1,1) to identify all paths (value must increase along the path) leading to that (1,1). These paths are highlighted as shown.

7. Some of the paths are 8. And these give alignments such as those on the left; all have a score of 5. CKHVFCRVCI CKKCFC-KCV CKHVFCRVCI CKKCFCK-CV C-KHVFCRVCI CKKC-FC-CKV CKH-VFCRVCI CKKC-FC-KCV

CKHVFCRVCI CKKCFC-KCV CKHVFCRVCI CKKCFCK-CV C-KHVFCRVCI CKKC-FC-CKV CKH-VFCRVCI CKKC-FC-KCV However if we used the more realistic PAM 250 substitution matrix then these alignments would have different scores (and the NWS algorithm would have picked the alignment with the highest one). Score with PAM 250 and gap penalty – – – 10 = – – 10 = – – – 3 x 10 = – – 3x10 = 22 Gap penalty is important; biology does not like gaps

Database searching Probe sequence –When we have a sequence (the probe sequence), often we want to find other sequences similar to it in a database Match sequence –The sequence(s) found by database search that is (are) similar to the probe sequence; also called a hit. Homologs –Sequences having the same ancestor (who diverged and evolved differently)

Score –Used to determine quality of match and basis for the selection of matches. Scores are relative. Expectation value –An estimate of the likelihood that a given hit is due to pure chance, given the size of the database; should be as low as possible. E.V.’s are absolute. A high score and a low E.V. indicate a true hit. Sequence identity (%) (or Similarity) –Number of matched residues divided by total length of probe

Rule-of-thumb for true hit –A database hit having a sequence identity of 25% or more (protein lengths 200 residues or more) is almost certainly a true hit Popular and powerful sequence search software –BLAST Or do a Google on “BLAST” –FASTA Or do a Google on “FASTA”

Genbank – maintained by USA National Center for Biology Information (NCBI) –All biological sequences w.html –Genomes enome Swiss-Prot - maintained by EMBL- European Bioinformatics Institute (EBI ) –Protein sequences Most important sequence databases

Multiple sequence alignment Often a probe sequence will yield many hits in a search. Then we want to know which are the residues and positions that are common to all or most of the probe and match sequences In multiple sequence alignment, all similar sequences can be compared in one single figure or tabl e. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the 'same' position in each sequence.

An example: cellulose-binding domain of cellobiohydrolase I Name of homologous domians Position of residue residues and position common to most homologs consensus

A schematic image of the 3D structure of the domian. Arrows indicate beta sheets. Other parts are loops. Kraulis J, et al., Biochemistry 1989, 28(18):

A sequence logo. This shows the conserved residues as larger characters, where the total height of a column is proportional to howconserved that position is. Technically, the height is proportional to the information content of the position.

Applications of multiple sequence alignment Identify consensus segments –Hence the most conserved sites and residues Use for construction of phylogenesis –Convert similarity to distance –Of genes, strains, organisms, species, life

ClustalW: A standard multiple alignment program Original paper –Thompson JD, Higgins DG, Gibson TJ. Nucleic Acids Res ; 22: Where to find on web – – – –bioweb.pasteur.fr/seqanal/interfaces/clustalw.html –Do a Google on “ClustalW”

Overview of ClustalW Procedure 1 PEEKSAVTALWGKVN--VDEVGG 2 GEEKAAVLALWDKVN--EEEVGG 3 PADKTNVKAAWGKVGAHAGEYGA 4 AADKTNVKAAWSKVGGHAGEYGA 5 EHEWQLVLHVWAKVEADVAGHGQ Hbb_Human 1 - Hbb_Horse Hba_Human Hba_Horse Myg_Whale Hbb_Human Hbb_Horse Hba_Horse Hba_Human Myg_Whale alpha-helices Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree) Progressive alignment following guide tree CLUSTAL W

Databases of multiple alignments Pfam: Protein families database of aligments and HMMs PRINTS, multiple motifs consisting of ungapped, aligned segments of sequences, which serve as fingerprints for a protein family BLOCKS, multiple motifs of ungapped, locally aligned segments created automatically fhcrc.org

Lecture on Sequence alignment by Per Kraulis, SBC, Uppsala University – Elementary Sequence Analysis by Brian Golding, Computational Biology, McMaster U. – helix.biology.mcmaster.ca/courses.html A rich resource of lectures is given at Research Computing Resource New York Universtiy School of Medicine – This lecture is mostly based on

Manual Alignment- software GDE- The Genetic Data Environment (UNIX) CINEMA- Java applet available from : – Seqapp/Seqpup- Mac/PC/UNIX available from : – SeAl for Macintosh, available from : – BioEdit for PC, available from : – oedit.html