Multiple Sequence Alignment Carlow IT Bioinformatics November 2006.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Structural bioinformatics
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Multiple Sequence Alignment School of B&I TCD May 2010.
Protein Sequence Alignment and Database Searching.
Bioinformatics 2 -- Lecture 8 More TOPS diagrams Comparative modeling tutorial and strategies.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.
Manually Adjusting Multiple Alignments Chris Wilton.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Aidan Budd, EMBL Heidelberg Multiple Sequence Alignments.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Sequence Based Analysis Tutorial
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

Multiple Sequence Alignment Carlow IT Bioinformatics November 2006

MSA A central technique in bioinformatics along with: –homology searching –multiple sequence alignment –phylogenetic trees

An example “all you have to do” is re-write your sequences so that similar features finish up in the same columns

Evolutionary relationship “similar features” ideally means homologous – with a shared ancestor clustalW and T-coffee mimic the process of evolution –by weighting similar residues by how conserved they are in evolution Important AAs don’t mutate Less important AAs change easily, even randomly –by inserting judicious gaps

Criteria for alignment Amino acids in the same column have –Structural similarity (used by threading progs) Practical exercise inferring position of Bsu recA AAs –Evolutionary similarity – residues have a common ancestor –Functional similarity (active site, C-C bonds) may have to hand edit known functions –Sequence similarity The first 3 (clear biological attributes) are, you hope, reflected by the last (an abstraction) which is what MSA programs use

Applications Discover conserved patterns/motifs –A step to describing a protein domain –MSA can add a distant relative to your protein family A step to define DNA regulatory elements. Prediction of 2 nd Structure and helps 3-D A step to phylogenetic trees: to describe or show the process of evolution PCR analysis/primer design –find most and least degenerate regions of your sequence

So why difficult? Trivial 2 seq alignment: 3 possibilities. As length and # of seqs increase, number of possible permutations goes astronomical FGDERTHHS FGD--DHRS FGDERTHHS FGDD--HRS FGDERTHHS FGD-D-HRS Where put the gap?

Some data Cat ATGAAACGTCGGATCTAA Dog ATGAATCGACCCATCTAA Mus ATGGCGTGGCTTGGCATGTGA Rat ATGGCATGTCGTGGCATGTAG Protocol step 1 Align each pair of seqs C-D, C-M, C-R etc Get a score for each alignment And make a …

Similarity matrix Cat Dog Mus Rat Cat ID Dog ID Mus ID 16 Rat ID Number of identical residues –Which pair of sequences is most similar?

Progressive alignment Align the two most similar sequences, inserting any gaps. Mus/Rat: lock these sequences together (call it “RODent) Return to similarity matrix to find next most similar seqs or sequence cluster Dog/Cat: align and lock (call it CARnivore) –if next step requires a gap, then gap inserted in both carnivore sequences Align next most …(now its iterative)

An alignment Cat ATGAAACGTCGG---ATCTAA Dog ATGAATCGACCC---ATCTAA Mus ATGGCGTGGCTTGGCATGTGA Rat ATGGCATGTCGTGGCATGTAG *** * * ** * Good: Always a two “sequence” problem –So computationally possible Bad: Can’t rewrite or decouple (part of) the dog/cat alignment in the light of later info. Locked in a (suboptimal?) trough.

More complex 10 seq example

Choosing the right seqs Use MSA to inform you! Always use AA/protein if possible –can copygaps back to DNA later Start with 6-15 sequences Eliminate very different (<30% id) seqs Eliminate identical sequences Watch out for partial sequences …or sequences that need ++ gaps to align Check for repeats with dotlet, Lalign

Less is more Large alignments –take ++ CPU and time –are hard to do well –are difficult to display –are difficult to use: in trees for example –may include marginal seqs that wreck whole alignment So start small and add/eliminate seqs until you have a clear informative picture

Level of variation is important Choose sequence family with best rate of evolution for your taxonomic group –Histones evolve very slow (compare kingdoms) –Transferrins are fast (compare classes,orders) Closely related sequences may have identical protein (but variable DNA) Distantly related sequences no DNA signal (“saturated”)

ClustalW at embnet.ch.org Paste in your FASTA sequences

Output choices

ClustalW at EBI Paste in your (FASTA) sequences

EBI: loads of options

T-coffee Minimal input parameters and STILL a better job than ClustalW

Output EBI clustalW Pairwise distance etc Alignment Guidetree What you submitted Jalview alignment editor

An alignment fragment ACT_CANAL -MDGEEVAALIIDNGSGMCKA ACT_CANDU -MDGEEVAALVIDNGSGMCKA ACT_PICAN -MDGEDVAALVIDNGSGMCKA ACT_PICPA -MDGEDVAALVIDNGSGMCKA ACT_KLULA -MDS-EVAALVIDNGSGMCKA ACT_YEAST -MDS-EVAALVIDNGSGMCKA ACT_YARLI -MED-ETVALVIDNGSGMCKA ACT2_ABSGL MSMEEDIAALVIDNASGMCKA ACT2_SCHCO --MDDEIQAVVIDNGSGMCKA : *:::**.****** * All AA in column identical : AA similar size & hydrophobicity. AA similar size or hydrophobicity ClustalW format

The alignment, so what next? Look at it very closely Hand edit if necessary (probably) Eliminate problem sequences and redo? Use display option best for next step –Phylip format for trees

Parameter changes Substit matrix PAM, Gonnet, Blosum –Clustalw chooses which matrix within family PAM30 for closely related pairs; PAM120; PAM250 for more distant –Difficult alignment: matrix change may help Gap penalty (open and extend) have optimal values for each family: find which by trial and error. –Clustalw puts gaps (which are often external loops) near previous gaps (longer loop) MSA does the grunt work. YOU do the fine tuning.

Guide tree To figure which pairs of sequences to align first, a phylogenetic tree is calculated from pairwise distance matrix. –Stored in a DND (dendrogram) file Never use this file to draw a tree Clustalw can construct a tree from the multiple sequence alignment (better than pairwise)

Alignment display: weblogo Always remember: sequence represents a 3-D structure

Patterns to recognise (more reliable in MSA than in single seq) Alternate hydrophobic residues –Surface  -sheet (zig-zag-zig-zag) Runs of hydrophobic residues –Interior/buried  -sheet Residues with 3.5AA spacing ( amphipathic ) –  -helix WNNWFNNFNNWNNNF Gaps/indels –Probably surface not core MSA improves 2ndary structure (  -helix  -sheet) prediction by >6%)

Conserved residues W,F,Y large hydrophobic, internal/core –conserved WFY best signal for domains G,P turns, can mark end of  -helix  -sheet C conserved with reliable spacing speaks C-C disulphide bridges - defensins H,S often catalytic sites in proteases (and other enzymes) KRDE charged: ligand binding or salt-bridge L very common AA but not conserved –except in Leucine zipper L234567L234567L234567L

Finish with an alignment: defensins 3 pairs of C residues: 3 disulphide bridges