Protein and RNA Families

Slides:



Advertisements
Similar presentations
Pfam(Protein families )
Advertisements

Basics of Comparative Genomics Dr G. P. S. Raghava.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profiles for Sequences
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Corrections. N-linked glycosylation (GlcNac): Look at the Swiss-Prot annotation (in a random ‘glycosylated’ entry)
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Database David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Protein World SARA Amsterdam Tim Hulsen.
Proteins to Proteomes The InterPro Database
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Sequence based searches:
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
A brief on: Domain Families & Classification
A brief on: Domain Families & Classification
Presentation transcript:

Protein and RNA Families Function Prediction Protein and RNA Families

Tell me what you do and I will tell you who you are …

From multiple alignments we can derive: A motif A profile (PSSM) A Hidden Markov Model

MOTIF Rxx(F,Y,W)(R,K)SAQ

Profile Scoring

Profile Hidden Markov Model (profile HMM) An MSA can be described by a HMM HMM is a probabilistic model of the MSA consisting of a number of interconnected states The different states are match, delete or insert. Each position is modeled independently The concatenation of the probabilistic models of the positions is the protein model.

Profile HMM D16 D17 D18 D19 100% 100% 50% M16 M17 M18 M19 100% 100% 16 17 18 19 M16 M17 M18 M19 D R T R D R T S S - - S S P T R D R T R D P T S D - - S D - - R 100% 100% 50% D 0.8 S 0.2 P 0.4 R 0.6 R 0.4 S 0.6 T 1.0 I16 I17 I18 I19 X X X X

Protein Domains Domains can be considered as building blocks of proteins. Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function. The presence of a particular domain can be indicative of the function of the protein.

C2H2 Zinc-Finger

DNA Binding domain Zinc-Finger

PROSITE ProSite is a database of protein domains that can be searched by either regular expression patterns or sequence profiles. Zinc_Finger_C2H2 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H

Pfam Database that contains a large collection of multiple sequence alignments and Profile hidden Markov Models (HMMs). High-quality seed alignments are used to build HMMs to which sequences are aligned The Pfam database is based on two distinct classes of alignments Seed alignments which are deemed to be accurate and used to produce Pfam A Alignments derived by automatic clustering of SwissProt, which are less reliable and give rise to Pfam B

Pfam coverage First 2000 families covered ~ 65% of UniProt Currently, 7503 families cover 74% of UniProt

Uses UniProt = SWISSPROT and TrEMBL InterPro Was built from protein classification databases, such as: PROSITE ProDom SMART Pfam PRINTS A total of 10403 entries Uses UniProt = SWISSPROT and TrEMBL

Applications of InterPro Diagnostic protein family signature database for: Classification of proteins through text and sequence search tools Large-scale classification Enhancing genome annotation -fly, human, rice mouse Proteome Analysis

GO (gene ontology) http://www.geneontology.org/ The GO project is aimed to develop three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes (P), cellular components (C) and molecular functions (F) in a species-independent manner. There are three separate aspects to this effort: first, to write and maintain the ontologies themselves; second, to make associations between the ontologies and the genes and gene products in the collaborating databases, and third, to develop tools that facilitate the creation, maintainence and use of ontologies Ontology is a description of the concepts and relationships that can exist for an agent or a community of agents

InterPro to GO InterPro: IPR000003 Retinoic acid receptor > GO: DNA binding GO:0003677 InterPro: IPR000003 AraC type helix-turn-helix > GO: transcription factor GO:0003700

Database and Tools for protein families and domains InterPro - Integrated Resources of Proteins Domains and Functional Sites Prosite – A dadabase of protein families and domain BLOCKS - BLOCKS db Pfam - Protein families db (HMM derived) PRINTS - Protein Motif fingerprint db ProDom - Protein domain db (Automatically generated) PROTOMAP - An automatic hierarchical classification of Swiss-Prot proteins SBASE - SBASE domain db SMART - Simple Modular Architecture Research Tool TIGRFAMs - TIGR protein families db

Clusters of Orthologous Groups of proteins (COGs) Classification of conserved genes according to their homologous relationships. (Koonin et al., NAR) Homologs - Proteins with a common evolutionary origin Orthologs - Proteins from different species that evolved by vertical descent (speciation). Paralogs - Proteins encoded within a given species that arose from one or more gene duplication events.

Clusters of Orthologous Groups of proteins (COGs) Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG.

COGS - Clusters of orthologous groups * All-against-all sequence comparison of the proteins encoded in completed genomes (paralogs/orthologs) * For a given protein “a” in genome A, if there are several similar proteins in genome B, the most similar one is selected * If when using the protein “b” as a query, protein “a” in genome A is selected as the best hit “a” and “b” can be included in a COG * Proteins in a COG are more similar to other proteins in the COG than to any other protein in the compared genomes * A COG is defined when it includes at least three homologous proteins from three distant genomes

Distribution of functional categories in the COGs database Function unknown General function, prediction only

Information in COGS * Annotation of proteins by members of known structure/function * Phylogenetic patterns - presence or absence of proteins in a given organism --> Enables following metabolic pathways * Multiple alignments

Discovering common motifs in unaligned sequences MEME-can be used for protein sequences as for DNA sequences

RNA families Rfam : General non-coding RNA database (most of the data is taken from specific databases) http://www.sanger.ac.uk/Software/Rfam/ Includes many families of non coding RNAs and functional Motifs, as well as their alignement and their secondary structures

Rfam (currently version 6.1) 379 different RNA families or functional Motifs from mRNA UTRs etc. GENE INTRON Cis ELEMENTS

An example of an RNA family miR-1 MicroRNAs