Download presentation
Presentation is loading. Please wait.
Published bySherilyn Goodman Modified over 9 years ago
1
Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751
2
Domain Analysis and Protein Families Introduction What are protein families? Protein families Description & Definition Motifs and Profiles The modular architecture of proteins Domain Properties and Classification
3
Protein families are defined by homology: I n a family, everyone is related to everyone Everybody in a family shares a common ancestor: Protein Families Protein family 1 Protein family 2
4
Homology versus Similarity Homologous proteins have similar 3D structures and (usually) share common ancestry: 1chg and 1sgt 31% identity, 43% similarity We can infer homology from similarity! 1chg 1sgt 1chg1sgt Superfamily: Trypsin-like Serine Proteases
5
Homology versus Similarity But Homologous proteins may not share sequence similarity: 1chg 1sgc 1chg1sgc Superfamily: Trypsin-like Serine Proteases 1chg and 1sgc 15% identity, 25% similarity We cannot infer similarity from homology
6
Homology versus Similarity Similar sequences may not have structural similarity: 1chg 2baa 1chg and 2baa 30% similarity, 140/245 aa We cannot assume homology from similarity!
7
Homology versus Similarity Summary Sequences can be similar without being homologous Sequences can be homologous without being similar Evolution / Homology BLAST Similarity Families ??
8
Domain Analysis and Protein Families Introduction What are protein families? Protein families Description & Definition Motifs and Profiles The modular architecture of proteins Domain Properties and Classification
9
Description of a Protein Family Let’s assume we know some members of a protein family What is common to them all? Multiple alignment!
10
Techniques for searching sequence databases to Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur Intermediate sequence search - link many profile searches
11
Motif Description of a Protein Family Regular expressions:........C.............S...L..I..DRY..I.......................W... I E W V / C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /
12
Automated Motif Discovery Given a set of sequences: GIBBS Sampler http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein MEME http://meme.sdsc.edu/meme/ PRATT http://www.ebi.ac.uk/pratt TEIRESIAS http://cbcsrv.watson.ibm.com/Tspd.html Combinatorial output!
13
Automated Profile Generation Any multiple alignment is a profile! PSIBLAST Algorithm: Start from a single query sequence Perform BLAST search Build profile of neighbours Repeat from 2 … Very sensitive method for database search
14
Position Specific Iterative Blast PSI-Blast profile models only positions in the query sequence PSI-BLAST Threshold for inclusion in profile QueryProfile1 Profile2... After n iterations
15
HMMs Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000) Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended) More the number of sequences better the models. One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)
16
HMM libraries PFAM http://www.sanger.ac.uk/Pfam http://www.sanger.ac.uk/Pfam The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Pfam-A entries are high quality, manually curated families. Pfam-B entries are generated automatically.
17
GTG steps 1.Generate alignment trace graph Nodes = residues Edges = aligned in PSI-Blast library Unweighted 2.Edge weighting Using consistency 3.Clustering Driven by consistency Single site occupancy rule 4.Post-processing Generate non-redundant set of inter-cluster edges Identify sub-trees with conserved residues
18
Alignment trace graph Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 -Graph representation of input pairwise alignment data -Vertices = residues -Edges = aligned in a pairwise alignment from input library Residues more residues
19
Consistency = neighbour overlap i j Weight = intersection / union
20
Input: PSI-Blast all versus all alignments in NRDB40 Output: superalignment of all proteins Applications Pairwise alignment of query and target sequences Transitive sequence database searching (fast) Tracking conserved residues (feature space) GTG – global trace graph
21
Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site occupancy) Cluster 2 Cluster 1 Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Alignment trace graph
22
‘Motif tracking’ K K K K K A A A H A G A K consistency K K Each vertex is labelled with source protein and position in sequence. Motifs are subtrees enriched in one particular amino acid type.
23
Remote homolog detection based on GTG alignment score GTG clustering is informative; detect as many remote homologs as threading methods
24
Summary Super-families form elongated clusters in “protein space” Profile models fluctuations around an equilibrium point Consistency ~ path model Exploits multiple profile models Discriminative in database searching Global trace graph data structure Feature space for pattern discovery http://ekhidna.biocenter.helsinki.fi/gtg/start
25
Relationships between families Pfam clans A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. Superfamily http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html The sequence search method uses a library (covering all proteins of known structure) consisting of 1539 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models.SCOPgroup Pfam-squared Based on GTG comparisons of representative sequences from each PFAM-A family against all PFAM-A families. Rules of thumb: motif score>1000 means probably related, motif score >500 means possibly related, score <500 means dubious
26
Benchmarking a motif/profile You have a description of a protein family, and you do a database search… Are all hits truly members of your protein family? Benchmarking: Dataset unknown family member not a family member TP: true positive TN: true negative FP: false positive FN: false negative Result
27
Precision / Selectivity Precision = TP / (TP + FP) Sensitivity / Recall Sensitivity = TP / (TP + FN) Balancing both: Precision ~ 1, Recall ~ 0: easy but useless Precision ~ 0, Recall ~ 1: easy but useless Precision ~ 1, Recall ~ 1: perfect but very difficult Benchmarking a motif/profile
28
Domain Analysis and Protein Families Introduction What are protein families? Protein families Description & Definition Motifs and Profiles The modular architecture of proteins Domain Properties and Classification
29
The Modular Architecture of Proteins BLAST search of a multi-domain protein Phosphoglycerate kinase Triosephosphate isomerase
30
Functional - from experiments: example: Decay Accelerating Factor (DAF) or CD55 What are domains? Has six domains (units): 4x Sushi domain (complement regulation) 1x ST-rich ‘stalk’ 1x GPI anchor (membrane attachment) PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696
31
There is only so much we can conclude… Classifying domains [To aid structure prediction (predict structural domains, molecular function of the domain)] Classifying complete sequences (predicting molecular function of proteins, large scale annotation) Majority of proteins are multi-domain proteins.
32
Mobile – Sequence Domains: What are domains? Mobile module Protein 1 Protein 2 Protein 3 Protein 4
33
Domains are......evolutionary building blocks: Families of evolutionarily-related sequence segments Domain assignment often coupled with classification With one or more of the following properties: Globular Independently foldable Recurrence in different contexts To be precise, we say: “protein family” we mean: “protein domain family”
34
Example: global alignment Phthalate dioxygenase reductase (PDR_BURCE) Toluene - 4 - monooxygenase electron transfer component (TMOF_PSEME) Global alignment fails! Only aligns largest domain.
35
Sometimes even more complex! PGBM_HUMAN: “Basement membrane-specific heparan sulphate proteoglycan core protein precursor” http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160 http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html 980 1960 2940 3920 4391 45 domains of 9 different type, according to PFam
36
Most domains: size approx 75 – 200 residues Properties of domains
37
So, you have a sequence......look it up in existing database INTERPRO: http://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interpro...search against existing family descriptions PFAM: http://www.sanger.ac.uk/Software/Pfamhttp://www.sanger.ac.uk/Software/Pfam INTERPROSCAN: http://www.ebi.ac.uk/Tools/InterProScan/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.