Download presentation
Presentation is loading. Please wait.
Published byRoss Merritt Modified over 6 years ago
1
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Xianhui Li
2
Outline What is Pfam? What is a Hidden Markove model (the methodology underlying Pfam)? How to use Pfam and sample output
3
pfam Pfam is a database of multiple alignments of protein domains or conserved protein regions. The alignments represent some evolutionary conserved structure which has implications for the protein's function. Profile hidden Markov models (profile HMMs) built from the Pfam alignments can be very useful for automatically recognizing that a new protein belongs to an existing protein family, even if the homology is weak.
4
Overview of Pfam Database
Pfam A contains curated families each with an associated profile HMM that can be used for alignment and database searching Annotation --contains several compulsory fields Seed alignment– a manually verified multiple alignment of a representative set of sequences HMM –profile— turned a multiple sequence alignment into a position-specific scoring system. Full alignment– generated automatically from the seed HMM-profile by searching Swisssprot for all detectable members and aligning them to the HMM profile Pfam B are clustered automatically, allowing Pfam to be comprehensive
5
Pfam Sequence Database Coverage
Figure 2. Pfam 2.0 contains domains from nearly half of all Swissprot 34 proteins. The automatic clusters in Pfam-B 2.0 contain domains from 33% of the Swissprot proteins that do not contain Pfam domains. When counting residue-by-residue, roughly a third of Swissprot is covered by Pfam and Pfam-B each. Pfam-B does not include proteins known to be fragments or segments shorter than 30 residues; the figures for unique sequences are therefore overestimated. residue Sequence Data shown is from Pfam v2.0 as of 1998 with 527 families. Current version is Pfam 12.0 (January 2004) contains alignments and models for 7316 protein families, based on the Swissprot 42.5 and SP-TrEMBL 25.6 protein sequence databases
6
Markov Model Simplest example: Each state emits (or, equivalently, recognizes) a particular element with probability 1. Begin Emit 1 Emit 2 Emit 4 Emit 3 End Example sequences:
7
Probabilistic Emission
If we let the states define a set of emission probabilities for elements, we can no longer be sure which state we are in given a particular element of a sequence BCCD or BCCD ? 0.5 0.25 0.75 0.9 0.1 0.2 0.8 1.0 Begin A (0.8) B(0.2) B (0.7) C(0.3) C (0.1) D (0.9) C (0.6) A(0.4) End
8
Hidden Markov Models (HMM)
Emission uncertainty means the sequence doesn't identify a unique path. The states are “hidden” Probability of a sequence is sum of all paths that can produce it: 0.5 0.25 0.75 0.9 0.1 0.2 0.8 1.0 Begin A (0.8) B(0.2) B (0.7) C(0.3) C (0.1) D (0.9) C (0.6) A(0.4) End p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * = =
9
HMMs for homology Homology model: ancestral residue (match) states, insertion states, deletion states. start insert match delete end
10
Profile HMM
11
Searching Pfam Web site: provide users the ability to search query protein sequences against one, all, or a few PfamHMM. _http:// _http://genome.wustl.edu/Pfam -- . Software: Users can use Pfam HMM-profile to search locally using the freely available HMMERsoftware package at:
12
Sample Pfam Query Results
Score Query Start Query End Hmm Start Hmm end Pfam Family Description 97.57 104 153 1 50 DAG_PE bind Phorbol Estser/ diacylglycerol binding domain 92.44 169 216 DAG_PE-bind 137.88 240 328 92 C2 C2 domain 276.16 413 674 247 pkinase Eukaryotic protein kinase domain 84.44 675 741 69 pkinase_C Protein kinase C terminal domain 70.99 807 857 17
13
Acknowledgements Some slides adapted from lectures by Larry Hunter at University of Colorado Health Sciences Center Altmann Lab for critical comments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.