Mining frequent patterns in protein structures: A study of protease families Dr. Charles Yan CS6890 (Section 001) ST: Bioinformatics The Machine Learning.

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Strict Regularities in Structure-Sequence Relationship
Chymotrypsin Chymotrypsin is one of the serine proteases.
Protein Sectors: Evolutionary Units of Three-Dimensional Structure Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganthan Cell 138, ,
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Mossbauer Spectroscopy in Biological Systems: Proceedings of a meeting held at Allerton House, Monticello, Illinois. Editors: J. T. P. DeBrunner and E.
Tertiary protein structure modelling May 31, 2005 Graded papers will handed back Thursday Quiz#4 today Learning objectives- Continue to learn how to manipulate.
Protein Modules An Introduction to Bioinformatics.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Single Motif Charles Yan Spring Single Motif.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Design of a novel globular protein with atomic-level accuracy.
Two Substrate Reactions
Protein Structures.
Protein Tertiary Structure Prediction
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Presentation on Neural Networks.. Basics Of Neural Networks Neural networks refers to a connectionist model that simulates the biophysical information.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Secondary structure prediction
Ozgur Ozturk, Ahmet Sacan, Hakan Ferhatosmanoglu, Yusu Wang The Ohio State University LFM-Pro: a tool for mining family-specific sites in protein structure.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
FlexWeb Nassim Sohaee. FlexWeb 2 Proteins The ability of proteins to change their conformation is important to their function as biological machines.
Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Examining Protein Folding Process Simulation and.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Protein backbone Biochemical view:
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Protein Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form in a biologically functional.
Predicting Active Site Residue Annotations in the Pfam Database
Protein Structure Prediction
Sequence Based Analysis Tutorial
Protein Structures.
Generalizations of Markov model to characterize biological sequences
Protein structure prediction.
SEG5010 Presentation Zhou Lanjun.
Applying principles of computer science in a biological context
Protein structure prediction
Presentation transcript:

Mining frequent patterns in protein structures: A study of protease families Dr. Charles Yan CS6890 (Section 001) ST: Bioinformatics The Machine Learning Approach Presented By: Bhavendra Matta

Presentation Structure Problem Problem Introduction Introduction Method Proposed Method Proposed Results Results Findings Findings About Authors About Authors Questions Questions

Problem Mining frequent patterns in protein structure: Mining frequent patterns in protein structure: Analysis of protein sequence and structure databases usually reveal frequent patterns (FP) associated with biological function. Data mining techniques generally consider the physicochemical and structural properties of amino acids and their microenvironment in the folded structures.

Important Terminology Frequent Patterns in Protein Structures : Frequent Patterns in Protein Structures : The primary structure of proteins is the sequence of amino acids in the polypeptide chain. FP here refers to frequent patterns found in each type of Amino acids. Conserved Residue: Conserved Residue: These are used to determine structural relationships between the sequences of a multiple sequence alignment. VHAVOYJBIO BHAVJOYBIO OYJVHAVBIO Here BIO is Conserved Residue. Protease : Protease : Protease refers to a group of enzymes whose catalytic function is to breakdown peptide bonds of proteins.

continue.. Catalytic triad Catalytic triad It refers to three amino acid residues found inside the active site of certain proteases. These include Asp 102, His 57, and Ser 195. It refers to three amino acid residues found inside the active site of certain proteases. These include Asp 102, His 57, and Ser 195. Unsupervised Learning. Unsupervised Learning. It is a method of machine learning where a model is fit to observations output. Here the unsupervised learning is clustering forming type. Microenvironment refers to the local structure assumed by residues close in space, but not necessarily contiguous along the sequence. There are strong correlations between function and microenvironment.

Introduction The paper presents a novel unsupervised learning approach to discover frequent patterns in the protein families. The paper presents a novel unsupervised learning approach to discover frequent patterns in the protein families. FP calculation are based on three features (with no prior Functional motifs knowledge) FP calculation are based on three features (with no prior Functional motifs knowledge) 1. Biochemical Features 1. Biochemical Features 2. Geometric Features 2. Geometric Features 3. Dynamic Features 3. Dynamic Features The identified FP’s for each amino acids belongs to three protease subfamilies. The identified FP’s for each amino acids belongs to three protease subfamilies. Chymotrypsin Chymotrypsin Subtillsin subfamilies of Serine proteases Subtillsin subfamilies of Serine proteases Papain subfamily Cysteine proteases Papain subfamily Cysteine proteases The catalytic triad residues are distinguished by their strong spatial coupling (high interconnectivity) to other conserved residues. The catalytic triad residues are distinguished by their strong spatial coupling (high interconnectivity) to other conserved residues.

continue…. Proteins Function is associated with a particular sequences or structure motif. Proteins Function is associated with a particular sequences or structure motif. Few catalytic residue database are: Few catalytic residue database are: PDB ( Protein Data Base) PDB ( Protein Data Base) PROCAT: Geometric hashing Function. PROCAT: Geometric hashing Function. WEBFEATURE: Bayesian Network WEBFEATURE: Bayesian Network PINTS: PINTS: TRILOGY: TRILOGY:

Method Training Dataset Training Dataset Feature Extraction Feature Extraction FP Discovery FP Discovery Conserved Residue Identification. Conserved Residue Identification. Rank of Conserved Residue. Rank of Conserved Residue.

Dataset A set of proteins belonging to a given family is selected as the training dataset. Features are extracted from all the amino acids in this dataset. Two classes of enzymes, serine proteases and cysteine proteases are analyzed here. Mainly all proteases typically have a catalytic triad at the active site. These enzymes are classified into evolutionary subfamilies S1-Chymotrypsin (S1) S8-Subtilisin of serine proteases C1-Papain of Cysteine proteases

Feature Extraction Each amino acid is characterized in terms of its   Dynamic features   Biochemical features   Geometric features of the residues in its microenvironment.

Dynamic features It uses Gaussian network model, an elastic network model for describing the equilibrium dynamics of proteins, is used for characterizing the dynamics features. GNM, the α-carbons (C) form the network nodes, and the nodes located within an interaction cut-off distance of 7.0. Å are connected via uniform elastic springs. Another structural property CN too have a strong impact on equilibrium dynamics is the CN, which is defined as the number of amino acids (or α- carbons) that coordinate the central amino acid within a first interaction shell of 7.0 Å.

Biochemical features It defines the Amino acid amino acid type and property. The classification is based here on both the specific amino acid identity chemical features or functional groups Chain mining multiple level association rules.

Geometric features It uses a 3D reference frame to define each residue, using the three backbone atoms N, Cα and C (carbonyl C). It uniquely defines the position and orientation of the residue in the 3D space..

FP Discovery It uses Apriori algorithm. It uses Apriori algorithm. Algorithm Algorithm Calculate occurrence and support of each feature to build the FP. Calculate occurrence and support of each feature to build the FP. Discard FPs with the support smaller than predefined minimum support. Discard FPs with the support smaller than predefined minimum support. Join the FPs to generate augmented FPs if length is FP is x then augmented FP length is x+1. Join the FPs to generate augmented FPs if length is FP is x then augmented FP length is x+1. Defining minimum support is based on the degree of FP to be considered. Defining minimum support is based on the degree of FP to be considered.

FP Discovery

Identification of Conserved Residue Applying Apriori Algorithm to proteins reveal FP with maximum length. Applying Apriori Algorithm to proteins reveal FP with maximum length. The FP occurs at least once in examined subfamily of proteins is considered to conserved FP. The FP occurs at least once in examined subfamily of proteins is considered to conserved FP. Next, the conserved residues are removed from the original dataset, and the Apriori algorithm is applied again to the modified dataset. All the conserved patterns of 20 types of amino acids were identified by this iterative search for each family.

Rank of Conserved Residue Once the conserved residues are identified by the Apriori algorithm, a ranking method is needed to distinguish the catalytic residues. It is assumed that the catalytic residues are optimally coupled with other conserved residues to achieve the highest cooperativity. The amino acids that show the lowest interconnectivity (smallest number of connected neighbors) are removed from the list of considered residues. The ‘core’ residues are assigned the score zero, and the others are scored according to the number of iterations required to reach the ‘core’ residues.

Results Consider the serine residues in the serine protease family. Information for a set of 111 serine residues is extracted from the 5 proteins in S1, and for a set of 250 serine residues from the 7 proteins in S8. This is consistent with the fact that the conservation of the microenvironment and global dynamics is a more restrictive (and discriminative) feature than sequence conservation. Another observation is that amino acids that sequentially neighbor the catalytic residues tend to be conserved. The present unsupervised learning algorithm identified 22, 22 and 26 conserved residues in the S1, S8 and C1 subfamilies.

continues…

Result Continues…

Conclusion A novel unsupervised leaning approach to discover biologically meaningful FPs in protein structures The approach incorporates features associated with collective dynamics (GNM slow mode shapes) as well as the biochemical (amino acid types and physicochemical properties) and geometric (3D coordination directions) features in the microenvironment. This approach can be used to discover and annotate all frequent patterns in the protein structure database. It can help to predict structure and function of uncharacterized proteins, and identify the important amino acids or structural regions.

About Authors Ivet Bahar   She is currently Chair and Professor of Department of Computational Biology, University of Pittsburgh, Pittsburgh.   She has more than 21 years of research work.  Currently Research Areas:  Characterization of Proteins Structural Classes  Characterization of Anti-Cancer Agents  Conformational Dynamics of Proteins  Protein Folding Kinetics

About Author Shann-Ching Chen Carnegie Mellon University, Pittsburgh Main focus on Machine Learning. Current Project Areas Retrieval of 3D Protein and Nucleic Acid Structures Retrieval of 3D Protein and Nucleic Acid Structures Multimodal Biometrics Multimodal Biometrics

Questions??? Thank You