Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Pfam(Protein families )
Orthology, paralogy and GO annotation Paul D. Thomas SRI International.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Protein Bioinformatics Course
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
bacteria and eukaryotes
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein families, domains and motifs in functional prediction
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Protein Bioinformatics Course
Presentation transcript:

Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Overview Introduction to protein domains. –Classification of homologs. Representing a domain. –PSSM –HMM Internet resources –Pfam –SMART –PROSITE –InterPro Research example.

Protein domains A discrete portion of a protein assumed to fold independently, and possessing its own function. Mobile domain ( “ module ” ): a domain that can be found associated with different domain combinations in different proteins.

Protein domains The assumption: The domain is the fundamental unit of protein structure and function. Protein family – all proteins containing a specific domain.

What can we learn from them? Common ancestors & homology information of a set of proteins. Homology can induce properties of a protein like functionality & localization. Therefore, domains can be used to classify a new protein to a family, inferring functionality.

Classification of homologs Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes. Homologous genes can be derived by two major ways: –Gene duplication (in the same species). –Speciation (splitting of one species into two).

Classification of homologs

Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species. Paralogs – Two genes that derive from a single gene that was duplicated within a genome.

Classification of homologs para ortho

Classification of homologs Inparalogs - paralogs that evolved by gene duplication after the speciation event. Outparalogs - paralogs that evolved by gene duplication before the speciation event.

Classification of homologs out-para In-para When comparing human with worm

What can we learn from them? Ortholog proteins are evolutionary, and typically functional counterparts in different species. Paralog proteins are important for detecting lineage-specific adaptations. Both of them can reveal information on a specific species or a set of species.

Protein domains – summary By identifying domains we can: –infer functionality & localization of a protein. –Learn on a specific species. –Learn on a set of species as a group.

Domain representation Different methods to represent (model) domains: Patterns (regular expressions). PSSM (Position specific score matrix). HMM (Hidden Markov model).

PSSM Position specific score matrix Score matrix representing the score for having each amino acid in a given position in a specific sequence. Based on the independent probabilities P(a|i) of observing amino acid a in position i.

PSSM: Example

PSSM: Identifying a domain Given a sequence and a PSSM: Run over all positions. Score each sub-sequence according to the matrix.

HMM: Hidden Markov Model Markov model: a way of describing a process that goes through a series of states. Each state has a probability of transitioning to the other states. x i is a random variable of state. x1x1 x2x2 x3x3 x4x4

HMM: Markov Model Example: States are  {0,1} x 1 =0x 2 =0 x 3 =0 x 4 =0x 1 =1x 2 =0 x 3 =0 x 4 =1x 1 =0x 2 =1 x 3 =1 x 4 =1

HMM: Markov Model Transition matrix: x1x1 x2x2 x3x3 x4x4 x

HMM: Markov Model State transition example: States are the nucleotides A, T, G, C.

HMM: Hidden Markov Model Hidden Markov model: Each state x emits an output y, at a specific probability. We only know the output (observations). Thus, the states are hidden. y1y1 y2y2 y3y3 y4y4 x1x1 x2x2 x3x3 x4x4

HMM: Hidden Markov Model Example: states are  {0,1}, output  {0,1} y 1 =1y 2 =1 y 3 =0 y 4 =0 x 1 =0x 2 =1 x 3 =1 x 4 =1 y 1 =1y 2 =0 y 3 =1 y 4 =0 x 1 =1x 2 =0 x 3 =0 x 4 =1

HMM: Hidden Markov Model y1y1 y2y2 y3y3 y4y4 x1x1 x2x2 x3x3 x4x4 Emission matrix: x y

HMM: What can we do with it? Given (A, B): Probability of given states and outputs Most likely sequence of states that generated a given output sequence Probability of a given output sequence

HMM: What can we do with it? Learning: Given state and output sequences calculate the most probable (A, B). Easy when the states are known. Otherwise: use a training algorithm.

HMM: Profile HMM Use HMM to represent sequence families. A particular type of HMM suited to modeling multiple alignments. (Assume we have a multiple alignment).

HMM: Trivial profile HMM We begin with ungapped regions. Each position corresponds to a state. Transitions are of probability 1.

HMM: Trivial profile HMM Let e i (a) be the independent probability of observing amino acid a in position i. The probability of a new sequence x, according to the model:

HMM: Trivial profile HMM We can score the sequence x: Where q indicates the probability under a random model.

HMM: Trivial profile HMM Consider the values They behave like elements in a score matrix. The trivial profile HMM is equivalent to a PSSM.

HMM: profile HMM Let ’ s untrivialize by allowing for gaps: insertions and deletions. Start off with the PSSM HMM.

HMM: profile HMM Handling insertions: Introduce new states I j – match insertions after position j. These states have random emission probabilities.

HMM: profile HMM The score of a gap of length k:

HMM: profile HMM Handling deletions: Introduce silent states D j. These states do not emit.

HMM: profile HMM The complete profile HMM:

Internet resources Databases of protein families. Family information and identification. Considerations: –Type of representation (pattern, PSSM, HMM). –Choice of seed multiple alignment proteins. –Quality control. –Database features (links, annotations, views). –Database Specificity (organism, functions).

Pfam: Home

Pfam Protein families database of alignments and HMMs Uses profile-HMMs to represent families. For each family in Pfam you can: –Look at multiple alignments –View protein domain architectures –Examine species distribution –Follow links to other databases –View known protein structures

Pfam: Databases 2 databases: Pfam-A – curated multiple alignments. –Grows slowly. –Quality controlled by experts. Pfam-B – automatic clustering (ProDom derived). –Complements Pfam-A. –New sequences instantly incorporated. –Unchecked: false positives, etc.

Pfam: Features Search by: Sequence, keyword, domain, taxonomy. Browsing by family or genome. Evolutionary tree

Pfam: Construction Source of seed alignments: –Pfam-B families. –Published articles. –'domain hunting' studies. –occasionally using entries from other databases (e.g. MEROPS for peptidases).

Pfam: Domain information

Pfam: Domain organization

Pfam: Multiple alignment

Pfam: HMM logo

Pfam: Species distribution

Pfam: Genome comparison

PROSITE Database of protein families. Matching according to simple patterns or PSSM profiles. Browsing all proteins of a specific family. Latest release knows 1696 protein families.

PROSITE: Features Comprehensive domain documentation. All profile matches checked by experts. Specificity/sensitivity: Specificity: true-pos/all-pos Sensitivity: true-pos/(true-pos + false-neg)

PROSITE: Example Specificity of Zinc finger C2H2 type domain

SMART

Simple Modular Architecture Research Tool Identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART consists of a library of HMMs. Knows 665 HMMs to date.

SMART: Features finding proteins containing specific domains i.e. of the same family Function prediction Sub-cellular localization Binding partners Architecture Alternative splicing information Orthology information

SMART: Domain selection example Tyrosine kinase (TyrKc) AND Transmembrane region (TRANS)

InterPro InterPro combines 9 other databases such as SMART, Pfam, Prodom and more. Queries can use many different methods (as the other databases use different methods). However, thresholds are predefined and cannot be changed for those methods.

InterPro Provides more results, but can sometimes be redundant. Coverage statistics: 93% of Swiss-Prot v42.5 – out of proteins 81% of TrEMBL v25.5 – out of proteins

InterPro: Features Searching by Protein/DNA sequences Finding domains & homologs List of InterPro entries of type: –Family –Domain –Repeat –PTM- Post Transcriptional modifications –Binding Site –Active Site –Keyword

InterPro: Example Kringle domain

Research Example: Introduction Goal: The systematic identification of novel protein domain families. Using computational methods.

Research Example: Method Derive set of 107 nuclear domains extract proteins Extract unannotated regions Cluster sequences Take longest member PSI-BLAST Investigate homologous regions Manual confirmation

Research Example: Results 28 New Domains identified: 15 domains in diverse contexts, in different species. 3 domains species specific. 7 domains with weak similarity to previously described domains. 3 extension domains.

Predictions of Function On the basis of reports in literature and/or occurrence with other identified domains, functional features can be predicted for our novel domain families. Examples: –Chromatin binding –Protein Interaction –Predicted sub-cellular localization

Predictions of Function: Chromatin-Binding example The novel domain CSZ is contained in protein SPT6, which regulates transcription via chromatin structure modification. SPT6 has a histone-binding capability, experimentally confirmed. Other domains (S1, SH2) in SPT6 are unlikely to bind histones or chromatin. Conclusion: CSZ has a predicted histone binding function.

Predictions of Function: Localization example Some of the novel domains are only found within proteins from the initial set of nuclear domains. This predicts that these domains have a nuclear function. The other domains are likely to have roles in both nucleus and cytoplasm.

Conclusion Domains are the functional units of proteins. Identifying a domain within a new protein may teach us much about it. There are several types of models to represent domains. These models can also be used to identify the domain they represent. Many Internet databases available to catalogue and identify families. Protocol to identify new domains using old ones.

Resources Pfam: SMART: PROSITE: InterPro:

The End