Consensus Fold Recognition Methods Dongbo Bu School of Computer Science University of Waterloo Joint work with S.C. Li, X. Gao, L. Yu, J. Xu, M. Li Nov.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Random Forest Predrag Radenković 3237/10
Brief introduction on Logistic Regression
Optimizing search engines using clickthrough data
Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Dynamic Bayesian Networks (DBNs)
Supervised Learning Recap
Minimum Redundancy and Maximum Relevance Feature Selection
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
What is Statistical Modeling
Xin Gao PhD student Outline Traditional Protein Structure Prediction  Introduction  Methods Review  Experimental Results Refinement  Motivation.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Structural bioinformatics
Fold Recognition Ole Lund, Assistant professor, CBS.
Heuristic alignment algorithms and cost matrices
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Fold Recognition Ole Lund, Associate professor, CBS.
Protein Fold recognition
Independent Component Analysis (ICA) and Factor Analysis (FA)
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Modelling Workshop - Some Relevant Questions Prof. David Jones University College London Where are we now? Where are we going? Where should.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Protein Tertiary Structure Prediction
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Bug Localization with Machine Learning Techniques Wujie Zheng
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Benk Erika Kelemen Zsolt
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Uncertainty Management in Rule-based Expert Systems
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Lecture 7. Computing Protein Structures Current attempts: Threading: RAPTOR Consensus: ACE Fragment assembly Can we compute the protein structures eventually?
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Surflex: Fully Automatic Flexible Molecular Docking Using a Molecular Similarity-Based Search Engine Ajay N. Jain UCSF Cancer Research Institute and Comprehensive.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Challenges in Creating an Automated Protein Structure Metaserver
Data Mining Practical Machine Learning Tools and Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Presentation transcript:

Consensus Fold Recognition Methods Dongbo Bu School of Computer Science University of Waterloo Joint work with S.C. Li, X. Gao, L. Yu, J. Xu, M. Li Nov. 2006

Outline Background Consensus Prediction Methods ACE7: consensus method by identifying latent servers Experimental Results Future Work

Background

From sequence to structure The Rate Gap –gene prediction is fast, –but experimental structure determination is slow The First Principle –Sequence almost determine structure CASP Competition –A fair and objective examination Computational Methods motivation possibility benchmark

Homologous Modeling --- sequence-sequence alignment

Threading ---sequence-structure alignment

Ab initio --- database independent

Why Consensus? Observation: –no single server can reliably predict the best models for all the targets. –a particular structure prediction server may perform well on some targets, but badly on others. A natural idea to solve this issue: – to combine the strengths of different prediction methods to obtain better structural models.

What is Consensus Method?

Formal Description Notations: –Target: the query protein sequence –Server: implementation of a prediction method –Model: a predicted structure

Classical Consensus Methods

Research History Early exploration of consensus idea: –Consensus many methods in one server. –INBGU (SHGU) D. Fischer 2000 –3D-PSSM (Phyre) L. Kelly 2000 The first consensus server –CAFASP-CONSENS: D. Fischer 2001 Successors: –Pcons/Pmodeller J. Lundstrom, A. Elofsson 2001 –3D-Jury K. Ginalski, A. Elofsson 2003 –3D-Shotgun D. Fischer 2003 –ACE L. Yu, J. Xu, M. Li 2004

Three-step Process Step1: Model Comparison –determine model similarities Step2: Feature Extraction –formal description of a model Step3: Model Selection –select a model, or part of it. Many machine learning techniques were introduced in the 3 rd step.

3D-Shotgun: Majority Voting Basic Idea: –Reminiscent of “cooperative algorithms” Five Input Servers: –GONP, GONPM, PRFSEQ, SEQPPRF, SEQPMPRF Step 1. Model Comparison –For each initial model, to find models with LOCAL similarity.

3D-Shotgun (cont) Step 2. Feature Extraction –For each model M, superimpose similar models upon M, –Using the shared similarity to compute transformation –Build a multiple structure alignment A(M) as a result, –Feature: the number of models share structural element with A(M).

3D-Shotgun (cont) Step 3. Selection –Majority Voting –Choose the structural element with the highest count. –The underlying rationale: The recurring structural elements are most likely to be correct.

Confidence Assignment For each assembled model M’, a confidence score S’ is given as follows: Here, –k,l run over all the input models –S_{k,l} is the confidence score given by the individual server –Sim() adopts MaxSub.

Performance of 3D-Shotgun

CAFASP-Consensus and Pcons: Neutral Network Step 1. Model Comparison –CAFASP-Consensus: check SCOP id, or run MaxSub –Pcons: LGScore2 to detect similarity Step 2: Feature Extraction –CAFASP-Consensus: number of similar models –Pcons: ratio of the similar models weighted f1 ratio of the similar 1 st model

CAFASP-Consensus and Pcons: (cont) Step 3. Model Selection –Formulated into a machine learning problem –Attribute: Log(LGScore2), significantly better than LGScore2.

Pmodeller = Pcons + ProQ ProQ: –a neutral network package to measure the quality of a structure Pmodeller has an advantage over Pcons because a number of high-score but false- positive models are eliminated.

Performance of Pcons/Pmod

ACE: SVM Regression Step 1. Model Comparison –MaxSub Step 2. Feature Extraction –f1: the normalized similarity with all the other models –f2: the normalized similarity with the most similar one –f3: for each target, to measure the divergence of server predictions.

ACE (cont) Step 3: Selection –SVM Regression: to predict the model quality –Attribute: MaxSub with the native structure

Performance of ACE In CASP6, ACE was ranked 2 nd among 87 automatic servers. On LiveBench test set:

Other techniques 3D-Jury: –Rationale: the average of lower energy conformation is similar to the native structure. –Basic Idea: Mimic the average step by the following scoring function:

Other techniques (cont) Robetta: –For each fragment, choose a local structure from a set, and assemble them to minimize an energy funtion. BPROMPT: –Bayesian Belief Network JPred: –Decision Tree

CASP7 Performance

ACE7: A Consensus Method by Identifying Latent Servers

Motivation Server Correlation: –Although consensus servers assume that each individual server is independent of others, it is observed from CASP6 results that correlation exists between different servers to some degree. Negative Effect: –this kind of correlation sometimes makes a native-like model receive less support than the incorrect models.

Examination of ACE on CASP6 Dataset Observation: –If a native-like model receives support from only 1or 2 server, it is difficult to select it.

Source of Server Correlation Server Correlation: –some servers tend to generate similar results, Reason: –Roughly speaking, the correlations arose from the fact that these servers adopt similar techniques, including sequence alignment tools, secondary structure prediction methods, and scoring functions,etc. Latent Servers: –Here, we use independent latent servers to represent the common features shared by these implicit servers.

ACE7: to reduce the server correlation Step 1. Adopting Maximum Likelihood to estimate the server correlation. Step 2. Employing Principle Component Analysis technique to derive the latent servers. Step 3. Using an ILP model to weigh the latent servers.

Two Assumptions of ACE7 Assumption 1: –Here, we approximate Ci,m by: Assumption 2:

Maximum Likelihood Estimation of Server Correlation Here,

Server Correlation Observation: –The server correlation is significant with respect to the fact that there are thousands of candidate models. –some servers are correlated more tightly than others. mGenThreader and RAPTOR (0.383) vs. FUGUE3 and Prospect (0.182). Implication: –These individual server may be clustered into cliques according to correlations; –the servers in a small clique may be underestimated according to the simple “majority voting” rule.

Uncovering the Latent Server

Uncovering the Latent Servers (cont) Using the PCA technique, the latent severs can be estimated as:

Explanation of Latent Servers Observation: –H1: represents MGTH and RAPT –H2: SPKS –H3: FUG3 –H4: ST02 –H5: PROS –H6: no preference

Construct a More Accurate Server Since latent servers are mutually independent, it is reasonable to assume: Key Point: –How to set the weight of each latent server? –An ILP model: To maximize the gap between the scores of the native-like models and incorrect models.

ILP Model (soft-margin idea)

Experiment on CASP7 Dataset Observation: –For T0363, ACE7 succeeds even only one server votes the native-like model.

Sensitivity of ACE7 Observation: –ACE7 has a higher sensitivity than any individual server.

Future Work

Conclusion Though consensus methods rely on structure clustering property, the server correlation also bring negative effect.

Future Work To find a better approximation of Ci,m. Using MaxSub instead of GDT. RAPTOR has a good performance in choosing the top 5 models, but always be puzzled to choose the top 1 model. We try to help to choose the best from the top 5 models remains an open problem.

Thanks.