Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.

Slides:



Advertisements
Similar presentations
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
Advertisements

©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Sequence Analysis Tools
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Protein Modules An Introduction to Bioinformatics.
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Multiple sequence alignments and motif discovery Tutorial 5.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignments
Single Motif Charles Yan Spring Single Motif.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Chapter 5 Multiple Sequence Alignment.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Multiple sequence alignment
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein and RNA Families
Manually Adjusting Multiple Alignments Chris Wilton.
Protein Domain Database
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple sequence alignment (msa)
Demo: Protein Information Resource
Genome Annotation Continued
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520 BioinformaticsJim Lund Prev. reading: Ch 1-5 Assigned reading: Ch 6.4, 6.5, 6.6

Information from Alignments Infer biological function –Conserved elements critical for function –Divergent elements relate to divergent function Infer structure (2°, 3°) Infer phylogeny –History –Evolutionary forces (selection…)

How do I find similar sequences?

Multiple Alignment Global, Optimal Theory Computation Progressive Alignment

Multiple Alignment: better alignments

Alignment Methods/Programs GAP (GCG suite) –Optimal Alignment MSA –(nearly) Optimal Alignment Clustal W/X –Progressive Alignment PSI-BLAST –Searches for matching sequences iteratively –Search seq is invariant master for the alignment.

MSA Strategy c(A)=  c(A i,j ) Minimize score! HUGE matrix(aa # of seqs)  CRASH computer –time~product of sequence length –1000x10,000 OK, but 200x200x200x200 NOT Alignment procedure –nearly optimal--only considers a subset of all alignment) –weight sequences via distance –branch-and-bound algorithm

Running MSA Download and run it locally (UNIX): – chaffer/genetic_analysis.htmlhttp:// chaffer/genetic_analysis.html On the internet: – align/multi-align.html Rerun on segments AFTER Clustal...

Clustal Strategy 1.Rapid pairwise alignments each-to-each 2.Calculate distance matrix –Create guide tree (neighbor joining) 3.Align –Closest pairs first –Add pairs or align sub-alignments –Adjust similarity matrix as alignment proceeds 4.Add sequences –introduce gaps gaps at loops, not inside known 2° structures Dynamic gap weighting

Clustal Strategy Pairwise alignments Guide tree Align

Clustal W(X) Strategy 1. Pairwise alignments The pairwise alignment number here is a dissimilarity measure.

Clustal W(X) Strategy 2. Unrooted neighbor tree (dendrogram)

Clustal W(X) Strategy 3. Guide tree

Clustal W(X) Strategy 4. Progressive alignment using guide tree

Running Clustal W/X WWW, Win, Mac, UNIX – Input –Multiple sequence file (PIR, FASTA,…) Can FORCE alignments Specify secondary structures Considerations –Fast, easy, widely used –Divergent proteins OK (trees misleading)

“The Right Proteins” GAPDH Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 *********** :**********.:***.*******************************

“The Right Proteins” GAPDH Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Human KAEDGKLVIDG-KAITIFQERDPENIKWGDAGTAYVVESTGVFTTMEKAGAHLKGGAKRI 118 Tobacco KVKDEKTLLFGEKSVRVFGIRNPEEIPWAEAGADFVVESTGVFTDKDKAAAHLKGGAKKV 110 Entamoeba EAGENAIIVNGHKIV-VKAERDPAQIGWGALGVDYVVESTGVFTTIPKAEAHIKGGAKKV 105 :. : :: * : : :*:* :* *. *. :********* ** **:*****::

Alignment Interpretation DNA sequences –>50% “worth looking at” (eyeball test) –~75% needed for phylogeny Polypeptide sequences –80% similar=SAME tertiary structure –30-80% domains=similar structure –15-30% ???? –<15% short motifs

Uses of Alignment Understanding or predicting mutant function Finding motifs in DNA or polypeptides Directing experiments--e.g. PCR primers Phylogeny

“The Right Proteins” Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117 Human KAEDGKLVIDG-KAITIFQERDPENIKWGDAGTAYVVESTGVFTTMEKAGAHLKGGAKRI 118 Tobacco KVKDEKTLLFGEKSVRVFGIRNPEEIPWAEAGADFVVESTGVFTDKDKAAAHLKGGAKKV 110 Entamoeba EAGENAIIVNGHKIV-VKAERDPAQIGWGALGVDYVVESTGVFTTIPKAEAHIKGGAKKV 105 :. : :: * : : :*:* :* *. *. :********* ** **:*****::

Viewing and interpreting alignments Color residues by property Conservation in the alignment Known properties Substitution groups: STA, HY Physiochemical property charge hydrophobicity Programs for visualization Jalview AMAS Alscript

Viewing alignments JalView alignment viewer

How to build multiple alignments 1.Find sequences to align (db search). 2.Choose which regions of each protein to include. Sequences should be of similar lengths. 3.Run multiple alignment program. 4.Inspect multiple alignment for problems. Regions with many gaps have aligned poorly. 5.Remove disruptive sequences and re-run alignment. 6.Add back remaining sequences avoiding disruption.

Motifs vs Alignment Motifs are short conserved segments In proteins: –PROSITE (“signal sites”) –Interpro In DNA: –TFD Tools for finding motifs: –ProfileScan –MEME

Interpro Pfam 7.3 (3865 domains), PRINTS 33.0 (1650 fingerprints), PROSITE 17.5 (1565 and 252 preliminary profiles), ProDom (1346 domains), SMART 3.1 (509 domains), TIGRFAMs 1.2 (814 domains), SWISS-PROT ( entries), TrEMBL ( entries).

Interpro A database of protein families, domains and functional sites PROSITE, home of regular expressions and profiles; Pfam, SMART, TIGRFAMs, PIRSF, and SUPERFAMILY keepers of hidden Markov models(HMMs); PRINTS, provider of fingerprints (groups of aligned, un-weighted motifs);

Interpro

NCBI CDD (Conserved Domain Database Domains from: Pfam (Protein families) –A database of protein families that currently contains > 7973 entries. SMART ( a Simple Modular Architecture Research Tool) –More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. –Domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. COGs (Clusters of Orthologous Groups) –Proteins or groups of paralogs from at least 3 lineages that correspond to an ancient conserved domain