Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751.

Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751

Domain Analysis and Protein Families Introduction What are protein families? Protein families Description & Definition Motifs and Profiles The modular architecture of proteins Domain Properties and Classification

Protein families are defined by homology:  I n a family, everyone is related to everyone  Everybody in a family shares a common ancestor: Protein Families Protein family 1 Protein family 2

Homology versus Similarity Homologous proteins have similar 3D structures and (usually) share common ancestry: 1chg and 1sgt  31% identity, 43% similarity We can infer homology from similarity! 1chg 1sgt 1chg1sgt Superfamily: Trypsin-like Serine Proteases

Homology versus Similarity But Homologous proteins may not share sequence similarity: 1chg 1sgc 1chg1sgc Superfamily: Trypsin-like Serine Proteases 1chg and 1sgc  15% identity, 25% similarity We cannot infer similarity from homology

Homology versus Similarity Similar sequences may not have structural similarity: 1chg 2baa 1chg and 2baa  30% similarity, 140/245 aa We cannot assume homology from similarity!

Homology versus Similarity Summary  Sequences can be similar without being homologous  Sequences can be homologous without being similar Evolution / Homology BLAST Similarity Families ??

Description of a Protein Family Let’s assume we know some members of a protein family What is common to them all? Multiple alignment!

Techniques for searching sequence databases to Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur Intermediate sequence search - link many profile searches

Motif Description of a Protein Family Regular expressions:........C.............S...L..I..DRY..I.......................W... I E W V / C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /

Automated Motif Discovery Given a set of sequences: GIBBS Sampler http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein MEME http://meme.sdsc.edu/meme/ PRATT http://www.ebi.ac.uk/pratt TEIRESIAS http://cbcsrv.watson.ibm.com/Tspd.html Combinatorial output!

Automated Profile Generation Any multiple alignment is a profile! PSIBLAST Algorithm: Start from a single query sequence Perform BLAST search Build profile of neighbours Repeat from 2 … Very sensitive method for database search

Position Specific Iterative Blast PSI-Blast profile models only positions in the query sequence PSI-BLAST Threshold for inclusion in profile QueryProfile1 Profile2... After n iterations

HMMs Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000) Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended) More the number of sequences better the models. One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)

HMM libraries PFAM  http://www.sanger.ac.uk/Pfam http://www.sanger.ac.uk/Pfam  The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).  Pfam-A entries are high quality, manually curated families.  Pfam-B entries are generated automatically.

GTG steps 1.Generate alignment trace graph Nodes = residues Edges = aligned in PSI-Blast library Unweighted 2.Edge weighting Using consistency 3.Clustering Driven by consistency Single site occupancy rule 4.Post-processing Generate non-redundant set of inter-cluster edges Identify sub-trees with conserved residues

Alignment trace graph Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 -Graph representation of input pairwise alignment data -Vertices = residues -Edges = aligned in a pairwise alignment from input library Residues more residues

Consistency = neighbour overlap i j Weight = intersection / union

Input: PSI-Blast all versus all alignments in NRDB40 Output: superalignment of all proteins Applications  Pairwise alignment of query and target sequences  Transitive sequence database searching (fast)  Tracking conserved residues (feature space) GTG – global trace graph

Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site occupancy) Cluster 2 Cluster 1 Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Protein 1 Protein 2 Protein 3 Protein 4 Protein 5 Alignment trace graph

‘Motif tracking’ K K K K K A A A H A G A K consistency K K Each vertex is labelled with source protein and position in sequence. Motifs are subtrees enriched in one particular amino acid type.

Remote homolog detection based on GTG alignment score GTG clustering is informative; detect as many remote homologs as threading methods

Summary Super-families form elongated clusters in “protein space”  Profile models fluctuations around an equilibrium point Consistency ~ path model  Exploits multiple profile models  Discriminative in database searching Global trace graph data structure  Feature space for pattern discovery http://ekhidna.biocenter.helsinki.fi/gtg/start

Relationships between families Pfam clans  A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. Superfamily  http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html  The sequence search method uses a library (covering all proteins of known structure) consisting of 1539 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models.SCOPgroup Pfam-squared  Based on GTG comparisons of representative sequences from each PFAM-A family against all PFAM-A families.  Rules of thumb: motif score>1000 means probably related, motif score >500 means possibly related, score <500 means dubious

Benchmarking a motif/profile You have a description of a protein family, and you do a database search… Are all hits truly members of your protein family? Benchmarking: Dataset unknown family member not a family member TP: true positive TN: true negative FP: false positive FN: false negative Result

Precision / Selectivity Precision = TP / (TP + FP) Sensitivity / Recall Sensitivity = TP / (TP + FN) Balancing both: Precision ~ 1, Recall ~ 0: easy but useless Precision ~ 0, Recall ~ 1: easy but useless Precision ~ 1, Recall ~ 1: perfect but very difficult Benchmarking a motif/profile

The Modular Architecture of Proteins BLAST search of a multi-domain protein Phosphoglycerate kinase Triosephosphate isomerase

Functional - from experiments: example: Decay Accelerating Factor (DAF) or CD55 What are domains? Has six domains (units):  4x Sushi domain (complement regulation)  1x ST-rich ‘stalk’  1x GPI anchor (membrane attachment)  PDB entry 1ojy (sushi domains only) P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696

There is only so much we can conclude… Classifying domains [To aid structure prediction (predict structural domains, molecular function of the domain)] Classifying complete sequences (predicting molecular function of proteins, large scale annotation) Majority of proteins are multi-domain proteins.

Mobile – Sequence Domains: What are domains? Mobile module Protein 1 Protein 2 Protein 3 Protein 4

Domains are......evolutionary building blocks: Families of evolutionarily-related sequence segments Domain assignment often coupled with classification With one or more of the following properties: Globular Independently foldable Recurrence in different contexts To be precise, we say: “protein family” we mean: “protein domain family”

Example: global alignment Phthalate dioxygenase reductase (PDR_BURCE) Toluene - 4 - monooxygenase electron transfer component (TMOF_PSEME) Global alignment fails! Only aligns largest domain.

Sometimes even more complex! PGBM_HUMAN: “Basement membrane-specific heparan sulphate proteoglycan core protein precursor” http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160 http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html 980 1960 2940 3920 4391 45 domains of 9 different type, according to PFam

Most domains: size approx 75 – 200 residues Properties of domains

So, you have a sequence......look it up in existing database  INTERPRO: http://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interpro...search against existing family descriptions  PFAM: http://www.sanger.ac.uk/Software/Pfamhttp://www.sanger.ac.uk/Software/Pfam  INTERPROSCAN: http://www.ebi.ac.uk/Tools/InterProScan/

Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751.

Similar presentations

Presentation on theme: "Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751.

Similar presentations

Presentation on theme: "Classifying the protein universe Synapse- Associated Protein 97 Wu et al, 2002. EMBO J 19:5740-5751."— Presentation transcript:

Similar presentations

About project

Feedback