Protein Families, Motifs & Domains.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Protein and Function Databases
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Situations where generic scoring matrix is not suitable Short exact match Specific patterns.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Protein Bioinformatics Course
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Copyright OpenHelix. No use or reproduction without express written consent1.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Protein Properties Function, structure Residue features Targeting Post-trans modifications BIO520 BioinformaticsJim Lund Reading: Chapter , 11.7,
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Group discussion Name this protein. Protein sequence, from Aedes aegypti automated annotation >25558.m01330 MIHVQQMQVSSPVSSADGFIGQLFRVILKRQGSPDKGLICKIPPLSAARREQFDASLMFE.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Functional manual annotation including GO
Demo: Protein Information Resource
Sequence based searches:
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Dot Plots, Path Matrices, Score Matrices
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Predicting Active Site Residue Annotations in the Pfam Database
Protein Bioinformatics Course
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Presentation transcript:

Protein Families, Motifs & Domains.

The Annotation Process DNA SEQUENCE Useful Information ANNALYSIS SOFTWARE Annotator

A Common Mistake! BLAST PROTEIN SEQUENCE Function Annotator

Protein Families, Motifs & Domains. A word about BLAST and FASTA Sequence alignment Domains Prosite Pfam/HMMs SignalP/ TMHMM

BLAST Local Alignment Suggests the presence of a common domain between two proteins. However common domains can be conserved between proteins with very different functions Eg ATP binding common to many proteins

BLAST/FASTA Reduces sensitivity increases specificity FASTA is a global alignment tool BLAST blast is local BLAST FASTA Reduces sensitivity increases specificity

Using FASTA Global Alignment Annotation gained from homology hits is only as good as the annotation you are transferring. Eg there are two different genes called ESAG2 in swall. Small changes in “your gene” might confer functional differences.

FASTA 10-5 Low scoring hits Can give good alignments

10-8 High scoring hits can give poor alignments

The big problem with searching public databases is… There is a need to reduce The amount of sequences We search and to prevent bad Annotation from spreading

Protein Families, Motifs & Domains. Proteins with common functions have some common features. Domains and motifs from conserved residues. Families can be grouped, profiles and HMMs derived. There is more to life than Blast

Sequence Alignment Sequence alignments allow us to see which residues are important to a family of proteins. This lets us make motifs/profiles/fingerprints/HMMs. To define families

Domains A domain is a functional part of a protein It may contain amino acid sequence motifs that can be used to identify it. More than one motif is known as a fingerprint

DOMAINS Motifs Prosite Fingerprints Blocks Pfam (HMMs) Domain Alignment Fingerprints Blocks Pfam (HMMs)

Prosite http://us.expasy.org/prosite/ Maintained a the swiss institute of Bioinformatics. All Motifs are checked for false positives and fine tuned. Sometimes a family can be defined by more than one expression. Fingerprints and BLOCKs automatically scan proteins for a number of motifs. http://bioinf.man.ac.uk/dbbrowser/PRINTS/ http://blocks.fhcrc.org/help/

Prosite (Bairoch et al (1997) NAR 25(1) 217-221) Single most conserved motifs Referred to as regular expressions or Patterns. Eg.. cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt C-Y-x2-[DG]-G-x-[ST]

Prosite PROSITE: PS00002 ID GLYCOSAMINOGLYCAN; RULE. AC PS00002; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE Glycosaminoglycan attachment site. PA S-G-x-G. RU Additional rules: RU There must be at least two acidic amino acids (Glu or Asp) from -2 to RU -4 relative to the serine. CC /TAXO-RANGE=??E??; CC /SITE=1,glycosaminoglycan; CC /SKIP-FLAG=TRUE; DO PDOC00002; //

Prosite documentation entry ************************************* * Glycosaminoglycan attachment site * Proteoglycans [1] are complex glycoconjugates containing a core protein to which a variable number of glycosaminoglycan chains (such as heparin sulfate, chondroitin sulfate, etc.) are covalently attached. The glycosaminoglycans are attached to the core proteins through a xyloside residue which is in turn linked to a serine residue of the protein. A consensus sequence for the attachment site seems to exist [2]. However, it must be noted that this consensus is only based on the sequence of three proteoglycan core proteins. -Consensus pattern: S-G-x-G [S is the attachment site] Additional rule: There must be at least two acidic amino acids from -2 to -4 relative to the serine. -Last update: June 1988 / First entry. [ 1] Hassel J.R., Kimura J.H., Hascall V.C. Annu. Rev. Biochem. 55:539-567(1986). [ 2] Bourdon M.A., Krusius T., Campbell S., Schwarz N.B. Proc. Natl. Acad. Sci. U.S.A. 84:3194-3198(1987).

Prosite A prosite hit is a binary piece of information (True/False). However some motifs are very simple. so many false positives. Some motifs should be found together. Documentation must always be read.

Hidden Markov Models Probabilistic models linking interconnecting states Profile HMMs represent linear chains of match, delete or insert. Each position in an alignment is assigned M,I or D. There is a defined probability of moving from one state to the next.

Hidden Markov Models D1 D2 D3 D4 I2 I3 I4 I0 I1 begin M1 M2 M3 M4

Pfam Pfam 7.0 contains a total of 3360 families. Pfam is a database of two parts: Pfam A ..curated Pfam B automatically generated. All HMMs have a seed alighment which is added to using the HMMer package.

Pfam http://www.sanger.ac.uk/Software/Pfam/ http://pfam.wustl.edu/

Pfam

Pfam

Pfam annotation

Pfam Scores E value expect value, same as for blast, probability of a hit by chance. Noise Cutoff The HMM score below which a hit is uninteresting. Trusted Cutoff The HMM score above which there should be no false positives.

                                                                          

Searching for Specific Domains Signal Peptides Secreted/targeted proteins Transmembrane Domains Membrane bound proteins

Interpro curation

TMHMM What is a transmembrane domain

TMHMM http://www.cbs.dtu.dk/services/TMHMM/

SIGNALP What Is a signal Peptide? Any protein that has to be targeted to a specific part of the cell requires a signal peptide. The signal peptide ensures that the protein in translated at the ER where it can enter the secretory pathway. Ie, the signal peptide suggests a cellular (or extracellular) location other than the cytoplasm.

SIGNALP

using secondary databases for functional Assignments Better, more detailed, proffesional annotation. More powerful and sensitive search methods, hmms/profiles/weight matrixes. Not as good coverage.

The Gene Prediction Process BLAST FASTA SignalP DNA SEQUENCE Functional Assignments ANNALYSIS SOFTWARE TMHMM Pfam Prosite Annotator