Proteomics: Analyzing proteins space. Protein families Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families.

Slides:



Advertisements
Similar presentations
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Advertisements

ProtoNet Automatic Hierarchical Classification of Proteins Brief Database overview December 2003.
Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,
Pfam(Protein families )
Basics of Comparative Genomics Dr G. P. S. Raghava.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Protein Functional Site Prediction The identification of protein regions responsible for stability and function is an especially important post-genomic.
Bioinformatics and Phylogenetic Analysis
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Protein and Function Databases
Single Motif Charles Yan Spring Single Motif.
Sequence alignment, E-value & Extreme value distribution
Protein Classification A comparison of function inference techniques.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein World SARA Amsterdam Tim Hulsen.
Protein and RNA Families
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Step 3: Tools Database Searching
Group discussion Name this protein. Protein sequence, from Aedes aegypti automated annotation >25558.m01330 MIHVQQMQVSSPVSSADGFIGQLFRVILKRQGSPDKGLICKIPPLSAARREQFDASLMFE.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Protein families, domains and motifs in functional prediction May 31, 2016.
Functional manual annotation including GO
Demo: Protein Information Resource
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Large-Scale Genomic Surveys
Sequence Based Analysis Tutorial
A brief on: Domain Families & Classification
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
A brief on: Domain Families & Classification
Presentation transcript:

Proteomics: Analyzing proteins space

Protein families Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families - what is it good for? Explosion in biological sequence data => need to organize! Understanding relations/hierarchy of groups is interesting as is, e.g. in evolutionary research. For applied research : –Annotation of new proteins : predicting their function, structure, cellular localization etc. –Looking for new folds

Sequence-based classification By sequence similarity (domains, motifs or complete proteins) : Pfam, PROSITE, SMART, InterPro etc. InterPro – Synthesizes the data from Pfam, PROSITE, Prints, ProDom, and SMART. Considered as “best” domain-based classification available

Other kinds of classification Global classification : –Systers, Protomap, CLUSTr –MetaFam synthesizes global classification data By structure similarity : SCOP etc. By function : Albumin, RetNet, TumorGenes etc.

A long-term project in HUJI led by Michal & Nati Linial. Provides automatic global classification of the known proteins. Performs hierarchical clustering on sequence-based metric space of proteins. Allows to “place” an external protein into the hierarchy.

Why clustering? We want to refine the “similarity” notion, compared to e.g. BLAST Exploit transitivity to improve grouping Can use a low threshold on similarity: - uses vast information from low similarities - allowable because clustering filters noise

Why hierarchical? Vertical Perspective Horizontal Perspective

ProtoNet: Pre-Computation All-against-all gapped BLAST using BLOSUM62 SwissProt release database (114,033 proteins) BLAST identified ~2*10 7 relations between these proteins with relatively high sequence similarity E-Score of 100 or less: Don’t want to lose information => very permissive! But still less then ~6.5*10 9 => infeasible

Clustering Method First, each cluster is considered a singleton

Clustering Method Next, we iteratively merge the pairs of clusters We choose to merge the ‘most similar’ pair of clusters.

Clustering Method Next, we iteratively merge the pairs of clusters We choose to merge the ‘most similar’ pair of clusters.

Clustering Method Next, we iteratively merge the pairs of clusters We choose to merge the ‘most similar’ pair of clusters.

Clustering Method As we progress the number of singletons drops

Clustering Method The clustering process gradually generates a tree of clusters Stop whenever we like

How to merge? The potential merging score is calculated for each pair of clusters relevant for merging at each level At the bottom equals Higher, designed to reflect the similarity of clusters. Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster. m n

Potential Merging Score of Arithmetic Mean VI Geometric Mean VI Harmonic Mean

Missing Data Treatment For very low similarity pair (outside of ~2*10 7 ), its length is defined as Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)

Results: ProtoNet top 20 Why clustering at all?We want to extend the range of “similarity”, compared to e.g. BLASTExploit transitivity to improve groupingCan use a low threshold on similarity:- uses vast information from low similarities- allowable because clustering filters noiseWhy clustering at all?We want to extend the range of “similarity”, compared to e.g. BLASTExploit transitivity to improve groupingCan use a low threshold on similarity:- uses vast information from low similarities- allowable because clustering filters noise 20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level

Problem of result assessment: what is a “good” cluster? Contains all proteins in the family, does not contain proteins not in family But what is family? Does any keyword define a family? Stable as the merging events occur (long life- time)?

Problem of result assessment: what is a “good” tree? Should we trust the resulting forest? –Which clustering technique is better? Combined? –Bootstrap? Do the clusters correspond to meaningful families of proteins? –Validation against InterPro, SCOP etc. –Lack of will to automatically reconstruct them!!! What is the right level/cut to look at the forest?

Interpro Validation Interpro annotation allows systematic validation of the generated clustering The ‘geometric’ method exhibits high cluster purity –Corresponds to low FP

The Domain Problem Many proteins are composed of several domains The sequence similarity tools used are therefore local in nature: The score of comparing two sequences is the edit distance of the most similar subsequences of them This creates a false similarity problem:

The Modular Nature of Proteins CSKP HUMAN DLG3 MOUSEK6A1 MOUSEMPP3 HUMAN Serine/Threonine protein kinase family active site Protein kinase C-terminal domain PDZ domain SH3 domain Guanylate kinase

8e-78 2e-47 9e-41 1e-42 False Transitivity of Local Alignment CSKP HUMAN DLG3 MOUSEMPP3 HUMANK6A1 MOUSE We ran BLAST using default parameters: All these pairwise similarities have better than 1e-40 EScore If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN

Alternative methods Different types of clustering –Non-binary –Goal-oriented => semi-guided –Graph theory insights Non-clustering ways of exploring the space of proteins Why BLAST E-score??? Enrichment of the metric using structure