Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Function preserves sequences
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Local alignment
Pairwise sequence Alignment.
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence Mutations change sequences

Spring 2002Christophe Roos - 4/6 Sequence comparison Sequence comparison Function by analogy: If sequences are conserved their function is probably also conserved. Functional domains: If some parts of the sequences are more conserved than other parts, there must be an underlying biological reason for it. Establishing relationship/differences in function: By quantification of sequence relationships it is possible to estimate function of novel genes Establishing relationship between species - Why, how? Compare two sequences of similar length Compare two sequences of very different length Compare several sequences Allow gaps or not? Scoring: yes-no or good-intermediate-bad The best or all above a threshold?

Spring 2002Christophe Roos - 4/6 Sequence comparison Sequence comparison – metrics The scoring matrix The score for a match The penality for a mismatch The penality for the insertion of a gap (gap-open) The penality for elongating a gap (gap-length) Local or global similarities ? GA-CGGATTAG GATCGGAATAG mismatch gap match

Spring 2002Christophe Roos - 4/6 Sequence comparison Scoring matrices When evaluating the occurrence of a pair, one scales the meaningfulness of its being there. The matrix is a table of values that describe the probability of a residue pair occurring in an alignment. Probabilities are derived from samples of alignments known to be valid. They can then be used to evaluate similarity of sequences with unknown function to sequences with known function.

Spring 2002Christophe Roos - 4/6 Sequence comparison Scoring matrices For DNA, they are usually binary: either there is similarity or there is not. For proteins, they reflect the chemical nature and frequencies of the amino acids, and cover a larger range of values.

Spring 2002Christophe Roos - 4/6 Sequence comparison Commonly used matrices for proteins Blosum matrices are derived from Blocks database that contain ungapped alignments from families of related proteins. The number indicates the similarity threshold level: Blosum62, Blosum45 PAM matrices are scaled according to a model of evolutionary distance from alignments of closely related sequences. One PAM-%1 unit is 1% change over all positions.

Spring 2002Christophe Roos - 4/6 Sequence comparison Walking through an alignment matrix Start with a gap (-) agains itself, score it 0. Fill in one row at a time At each position compute the scores that result for each of the choices: move one step in each sequence (diagonal), skip one horizontal or one vertical. Choose the best of the three values and save it. Score +5 for match, -4 for mismatch and –7 for a gap. If  write 0. Traceback along the highest scoring path. Example: 10-4=6 (diagonal) 10-7=3 (gap, horizontal or vertical)

Spring 2002Christophe Roos - 4/6 Sequence comparison Global alignments and local ones Aligning 2 sequences along their whole length is done by stepping through the matrix from top left to bottom right. The best-scoring path can be traced through the matrix, resulting in an optimal alignment. The Needleman- Wunsch algorithm belongs to this class. Sequences are often modular, therefore similarities can be only local and global alignments will fail. The Smith-Waterman is a dynamic programming algorithm that performs local alignment of 2 sequences. If the cumulative score up to some point in the sequence is negative, it can be abandoned. It can also end anywhere in the matrix.

Spring 2002Christophe Roos - 4/6 Sequence comparison Example: local alignment of 2 sequences Web page: enter two sequences and search for local alignments. Two are found.

Spring 2002Christophe Roos - 4/6 Sequence comparison Iterate: compare one against many By iterating pairwise comparisons, one can compare one sequence agains a database of many sequences. Algorithms such as Smith & Waterman are too slow (quality optimised). Multistep algoritms have been developed for this task –Fasta: (i) use only every k:th position (k is usually 2 for proteins and 6 for DNA) and search short sequences (k-tups). (ii) score the 10 ungapped alignments with most identical k-tups. (iii) try to merge into a gapped alignment without reducing the score below a threshold. –Blast: (i) create a list of short words that score enough when compared to the query (ii) search these words in a precomputed table of all words and their positions in the database (iii) extend into ungapped or even gapped local alignments.

Spring 2002Christophe Roos - 4/6 Sequence comparison Example: BLAST one against SwissProt Web page: enter the sequence and search for local alignments. Several are found and listed both graphically and as text. Note the modularity of the query: two domains are apparent.

Spring 2002Christophe Roos - 4/6 Sequence comparison Iterate (2): compare many against many Multiple sequence alignments Example: The eyeless gene is also called PAX6 and can be found in several species: birds, mammals, reptiles, fish, invertebrates

Spring 2002Christophe Roos - 4/6 Sequence comparison Multiple sequence alignments CLUSTAL W (1.81) multiple sequence alignment PAX6_CHICK PAX6_HUMAN MQNS HSGV 8 PAX6_MOUSE MQNS HSGV 8 PAX6_COTJA MQNS HSGV 8 PAX6_BRARE MPQKEYYNRATWESGVASMMQNS HSGV 27 PAX6_ORYLA MPQKEYHNQATWESGVASMMQNS HSGV 27 PAX6_XENLA MQNS HSGV 8 PAX6_DROME MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGV 60 PAX6_CHICK PAX6_HUMAN NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_MOUSE NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_COTJA NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_BRARE NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 87 PAX6_ORYLA NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 87 PAX6_XENLA NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_DROME NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 120 PAX6_CHICK PAX6_HUMAN RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_MOUSE RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_COTJA RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_BRARE RAIGGSKPRVATPEVVGKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 147 PAX6_ORYLA RAIGGSKPRVATPEVVAKIAQYKRECPSIFAWEIRDRLLSEGICTNDNIPSVSSINRVLR 147 PAX6_XENLA RAIGGSKPRVATPEVVNKIAHYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_DROME RAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLR 180 PAX6_CHICK PAX6_HUMAN NLASEKQQMGA PAX6_MOUSE NLASEKQQMGA PAX6_COTJA NLASEKQQMGA PAX6_BRARE NLASEKQQMGA PAX6_ORYLA NLASEKQQMGA PAX6_XENLA NLASDKQQMGS PAX6_DROME NLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL 240 First all sequence pairs are aligned and scored, then in a second round a multiple sequence alignment is built up. In this case (PAX6 proteins from vertebrates and fruit fly), two domains are more conserved than the rest of the sequence. Only the first domain is shown here.

Spring 2002Christophe Roos - 4/6 Sequence comparison Multiple sequences in phylogeny Once a multiple sequence alignment is done, it can be used for finding –Domains (previous slide) –Relationship (evolutionary distance) The distance is calculated as the amount of mutations needed to evolve from a putative ancestor to all used ‘present-day’ sequences. Then a path including all sequences is computed. Different metrics can be used (most parsimonious, maximum likelihood, etc).