Download presentation
Presentation is loading. Please wait.
1
Bayesian Classification of Protein Data Thomas Huber huber@maths.uq.edu.au Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics The University of Queensland huber@maths.uq.edu.au
2
Today’s talk Protein score functions from mining protein data –Bayesian classification A toy example A protein scoring function for fold recognition Where are score/energy functions useful? –A few examples
3
Why do we care about Protein Structures/Prediction? Academic curiosity? –Understanding how nature works Urgency of prediction – 10 4 structures are determined insignificant compared to all proteins –sequencing = fast & cheap –structure determination = hard & expensive Transistors in Intel processors TrEMBL sequences (computer annotated) SwissProt sequences (annotated) structures in PDB
5
Three basic choices in (molecular) modelling Representation –Which degrees of freedom are treated explicitly Scoring –Which scoring function (force field) Searching –Which method to search or sample conformational space
6
Protein Scoring Functions from Mining Protein Data Classification Theory –Find a set of classes and their descriptors (a classification) for n data q attributes (shape, amino acid type, etc.) Theory of finite mixtures Class attribute probability distribution of all members
7
Bayesian approach Simplifications –Stating a simplified model –Assume attributes are independently distributed P(X i c j |S) requires class description –Expectation Maximization (EM)
8
How many classes Again Bayes’ rule P(m) favours smaller number of classes –No over-fitting of data (like with maximum likelihood methods)
9
A Toy Example Dihedral preference of Valine Four interesting degrees of freedom – -, -dihedral angle –Adjacent amino acid types Data:893 non-redundant proteins –12074 four-dimensional data points i-1i+1
10
Valine Data Classification AutoClass classification –Model: Gaussian distribution for / , discrete probabilities for amino acids –Total of 50 tries with #classes [2:11] –Each try refined until fully converged Best classification has 5 classes
11
Amino Acid Attribute vectors of -helix Classes Log-Preferences
12
Re-invention of the Wheel Textbook secondary structure pattern –Helices are likely on outside of proteins –I, I+3 and I+4 hydrophobic interface From C.-I. Branden and J. Tooze, Introduction to Protein Structure
13
Fragment-based Protein Scoring Find classification for fragments of size 7 residues –237566 fragments (1494 non-redundant protein chains) –28 descriptors 7 amino acid type 14 -/ -dihedral angles 7 number of neighbours of each amino acid 200 CPU hours on National Facility computers 325 classes (modelling the probability distribution of native fragments) Use this classification to evaluate likelihood of a fragment sequence- structure match Total score = fragment scores
14
Fold Recognition = Computer Matchmaking Structure Disco
15
Does it work? Discrimination (TIM 1amk_) Generalisation 1 2 3 4 5 1 2 5 3 4
16
Sequence-Structure Matching The search problem Gapped alignment = combinatorial nightmare
17
Why is Fold Recognition better than Sequence Comparison? Comparison is done in structure space not in sequence space
18
Finding Remote Homologues with sausage 572 sequence-structure pairs Structures are similar (FSSP) > 70% structurally aligned < 20% sequence identity
19
RNA-dependent RNA Polymerases
20
A Real Case Example RNA-dependent RNA polymerases Dengue virus Bacteriophage 6
21
Is this Yet Another Profile Method? Yes, but a much more general profile method –Profile is not residue based (like profile-like threading force fields) –Profiles not for protein families (like in HMMs or -Blast) –BUT local sequence profiles for optimally chosen classes of fragments Local profiles can be arbitrarily assembled –Extreme flexibility Sequence-structure alignment (=assembling best profile matches) –Deterministic, using dynamic programming
22
People sausage –Andrew Torda (RSC) –Oliver Martin (RSC) GlnB/GlnK, RdR polymerases –Subhash Vasudevan (JCU) Sausage and Cassandra freely available http://rsc.anu.edu.au/~torda huber@maths.uq.edu.au
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.