InterPro Sandra Orchard.

Slides:



Advertisements
Similar presentations
Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein and Function Databases
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein function and classification Hsin-Yu Chang
Protein function and classification Hsin-Yu Chang
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
EBI web resources II: Ensembl and InterPro Yanbin Yin Fall
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Motif discovery and Protein Databases Tutorial 5.
Protein Domain Database
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK Bioinformatics:
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Demo: Protein Information Resource
Sequence based searches:
Sandra Orchard EMBL-EBI
Dot Plots, Path Matrices, Score Matrices
Genome Annotation Continued
PIR: Protein Information Resource
There are four levels of structure in proteins
Sequence Based Analysis Tutorial
InterPro An Introduction
A brief on: Domain Families & Classification
PROTEIN PATTERN DATABASES
A brief on: Domain Families & Classification
Presentation transcript:

InterPro Sandra Orchard

Why do we need predictive annotation tools?

what are these proteins; to what family do they belong? Given a set of uncharacterised sequences, we usually want to know: what are these proteins; to what family do they belong? what is their function; how can we explain this in structural terms? 3

2. The protein signature approach 1. Pairwise alignment approaches (e.g. BLAST) Good at recognising similarity between closely related sequences Perform less well at detecting divergent homologues 2. The protein signature approach Alternatively, we can model the conservation of amino acids at specific positions within a multiple sequence alignment, seeking ‘patterns’ across closely related proteins We can then use these models to infer relationships with previously characterised sequences This is the approach taken by protein signature databases 4

What are protein signatures? Protein family/domain Build model Multiple sequence alignment Search UniProt Protein analysis Significant match ITWKGPVCGLDGKTYRNECALL Mature model AVPRSPVCGSDDVTYANECELK

Diagnostic approaches (sequence-based) Single motif methods Regex patterns (PROSITE) Full domain alignment methods Profiles (Profile Library) HMMs (Pfam) Multiple motif methods Identity matrices (PRINTS)

Patterns Sequence alignment Motif Define pattern Extract pattern sequences xxxxxx C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression Pattern signature PS00000

Patterns Advantages Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C Drawbacks High False Positive/False Negative rate Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites

Fingerprints Motif 2 Motif 3 Motif 1 Define motifs Sequence alignment Extract motif sequences xxxxxx Weight matrices Fingerprint signature 1 2 3 Correct order Correct spacing PR00000

The significance of motif context Identify small conserved regions in proteins Several motifs  characterise family Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours 1 2 3 4 5 order interval

Models insertions and deletions Profiles & HMMs Entire domain Define coverage Whole protein Sequence alignment Use entire alignment for domain or protein xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx Build model Models insertions and deletions This is a good summary of how Profiles and HMMs are made. Take a multiple sequence alignment and either use the entire alignment (family model) or define the domain of interest (domain model). If a domain model, then extract the sequence from the alignment defining the domain. Use the alignment to build a Profile matrix or an HMM. A signature match is either non-positional and defines family membership, or it defines the position of the domain on the protein. The view of a Profile or HMM hit in InterPro. Profile or HMM signature 11

HMM databases Sequence-based PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship PANTHER: families/subfamilies model the divergence of specific functions TIGRFAM: microbial functional family classification PFAM : families & domains based on conserved sequence SMART: functional domain annotation Structure-based SUPERFAMILY : models correspond to SCOP domains GENE3D: models correspond to CATH domains

Why we created InterPro By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database to simplify & rationalise protein analysis to facilitate automatic functional annotation of uncharacterised proteins to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross-references to other databases

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links to other databases Links to other databases Structural information and viewers Hierarchical classification

InterPro hierarchies: Families FAMILIES can have parent/child relationships with other Families Parent/Child relationships are based on: Comparison of protein hits child should be a subset of parent siblings should not have matches in common Existing hierarchies in member databases Biological knowledge of curators

InterPro hierarchies: Domains DOMAINS can have parent/child relationships with other domains

Domains and Families may be linked through Domain Organisation Hierarchy 17

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links to other databases Links to other databases Structural information and viewers

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links to other databases Links to other databases Structural information and viewers The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics TALK MORE ABOUT HOW WE DO GO MAPPING IN INTERPRO

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links to other databases Links to other databases Structural information and viewers UniProt KEGG ... Reactome ... IntAct ... UniProt taxonomy PANDIT ... MEROPS ... Pfam clans ... Pubmed

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links to other databases Links to other databases Structural information and viewers PDB 3-D Structures SCOP Structural domains CATH Structural domain classification

Searching InterPro

Searching InterPro Protein family membership Domain organisation Domains, repeats & sites GO terms

Searching InterPro

InterProScan access Interactive: http://www.ebi.ac.uk/Tools/pfa/iprscan/ Webservice (SOAP and REST): http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap Download: ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Master headline