EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering
Bioinformatics Computational Molecular Biology Bioinformatics Genomics Functional genomics Proteomics Structural bioinformatics
DNA is the blueprint for life Every cell in your body has 23 chromosomes in the nucleus The genes in these chromosomes determine all of your physical attributes. Image source: Crane digital, http://www.cranedigital.com/
Mapping the Genome The human genome project has provided us with a draft of the entire human genome. Four bases: A, T, C, G 3.12 billion base-pairs 99% of these are the same Polymorphisms = where they differ As of July, 2000 Everything that makes you different from every other human (as far as we know) is encoded in this information.
How does the code work? Template for construction of proteins Inherited disease: broken dna broken proteins Viral disease: dna/rna foreign proteins Bacterial disease: organisms with their own proteins
The genome is information: TATAAGCTGACTGTCACTGA 4 Bases: A,G,C,T one codon 20 Amino Acids 3apr.pdb The representation is much more complex in many ways than in EC
Proteins: Molecular machinery Proteins in your muscles allows you to move: myosin and actin
Proteins: Molecular machinery Enzymes (digestion, catalysis) Structure (collagen)
Proteins: Molecular machinery Signaling (hormones, kinases) Transport (energy, oxygen) Image source: Crane digital, http://www.cranedigital.com/
Growth of biological databases GenBank BASEPAIR GROWTH Source: GenBank 3D Structures Growth: Source: http://www.rcsb.org/pdb/holdings.html
Applications What can we do with all this information? Cure diseases – computational drug design Model protein structure Understand relationships between organisms (phylogenies)
Example Case: HIV Protease Exposure & infection HIV enters your cell Your own cell reads the HIV “code” and creates the HIV proteins. New viral proteins prepare HIV for infection of other cells. 3 most important classes of human disease Genetic Viral Bacterial © George Eade, Eade Creative Services, Inc. http://whyfiles.org/035aids/index.html
HIV Protease as a drug target Many drugs bind to protein active sites. This HIV protease can no longer prepare HIV proteins for infection, because an inhibitor is already bound in its active site. Genetic: Fix DNA, introduce new DNA, or synthesize the missing protein Example: Diabetes – use insulin now, but introducing DNA would be a better solution Viral and bacterial – protein drug targets HIV Protease + Peptidyl inhibitor (1A8G.PDB)
Drug Discovery Target Identification Lead discovery & optimization What protein can we attack to stop the disease from progressing? Lead discovery & optimization What sort of molecule will bind to this protein? Toxicology Does it kill the patient? Does it have side effects? Does it get to the problem spots?
Drug Development Life Cycle Discovery (2 to 10 Years) Preclinical Testing (Lab and Animal Testing) Phase I (20-30 Healthy Volunteers used to check for safety and dosage) Phase II (100-300 Patient Volunteers used to check for efficacy and side effects) Phase III (1000-5000 Patient Volunteers used to monitor reactions to long-term drug use) $600-700 Million! FDA Review & Approval Post-Marketing Testing Years 0 2 4 6 8 10 12 14 16 7 – 15 Years!
Drug lead screening 5,000 to 10,000 compounds screened 250 Lead Candidates in Preclinical Testing 5 Drug Candidates enter Clinical Testing; 80% Pass Phase I 30%Pass Phase II 80% Pass Phase III One drug approved by the FDA
Finding drug leads Once we have a target, how do we find some compounds that might bind to it? The old way: exhaustive screening The new way: computational screening!
Chemistry 101 Like everything else in the universe, proteins and drugs are made up of atoms. Some atoms, like oxygen, tend to have a negative charge. Some, like nitrogen, tend to be positively charged. When the two come together, they attract like magnets.
The Goal: Drug lead Protein
Drug Lead Screening & Docking ? When searching for a drug lead Electrostatic complementary Hydrogen bonding potential SHAPE COMPLEMENTARITY HYDROPHOBIC INTERACTION Complementarity Shape Chemical Electrostatic
But it gets more complicated About 60% of our body weight is water. Most proteins are surrounded by water molecules Interact with protein-drug complexes
Protein-water interactions Water has both negative and positively charged atoms. It can bridge gaps between drugs and proteins. Protein surface Ligand Water molecule
Will the water stay? When a drug comes close to a protein, some of the water molecules are displaced. Protein surface Ligand Water molecule
Pattern Recognition Model Raw measurement data Transducer, etc. Sample f1 f2 f3 f4 f5 Feature vector = pattern Before I tell you how we improved on the pattern recognizers we used… Define the pattern recognition problem terminology d
Training and Testing C N Classifier Labeled training data ? Cube N C N C N C Labeled training data f1 f2 f3 f4 f5 ? Classifier Classification/ prediction
Temperature Factor (B-Value) How wiggly is it? Here a protein (dihydrofolate reductase) is colored by temperature factor.
Atomic Density (ADN) How crowded is it? The atomic density of this water molecule is 5.
Prediction of water molecules Blue spheres are water molecules predicted to stay in the active site. Wire mesh spheres are water molecules predicted to be displaced — booted out by the ligand.
Nearest Neighbor Classification Feature 1 Feature 2 The specific pattern recognizer we chose to experiment with first was knn Simple Well understood in the literature Good results = class 1 training sample = class 2 training sample = new unknown (test) sample
Feature Weighted knn a. b. Scale Extended Class 1 Class 2 Unknown Weight = 0 Don’t use feature GOAL: Minimum set of features that classifies well Faster calculation Data mining Sometimes even better accuracy Feature 1 Feature 2 Scale Extended
GA & knn Interaction ... Masked Weight Vector & k Genetic Algorithm Masked Weight Vector & k W1 W2 W3 W4 W5 KNN Classifier W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W2 ... W1 Fitness — How is it calculated?
Weighting and Masking How do we sample feature subsets? Weight below a threshold value: slow sampling Masking: Distinct mutation rates, or multiple mask bits Intron effect Classifier parameters (k) on the chromosome W1 W2 W3 W4 W5 M1 M2 M3 M4 M5 k 73.2
The Cost Function We can direct the search toward any objective. Classification accuracy Class balance Feature subset parsimony (reduce d) The GA minimizes the cost function:
UCI Data Set Results Typically we see modest gains in classification accuracy, with significant reduction in features. Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences. http://www.ics.uci.edu/~mlearn/MLRepository.html
Classical Methods & SFFS Thyroid data Best feature selection performance within 1% of best accuracy Appendicitis data Best feature selection Weiss, S. and Kapouleas, I. (1990). An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods. Morgan Kaufmann.
Feature Extraction Framework Features and Classifier Parameters Genetic Algorithm Classifier ? ... Fitness based on accuracy
Reducing Computation Time Most of the compute time is spent calculating distances and finding nearest neighbors. Branch and bound knn[1] scales: linearly with d, polynomial with n, O(nc) 1.0 < c < 2.0 Can we use a faster classifier? [1] Fukunaga, K. and Narendra, P. M. (1975). A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, 750–753.
The Bayes Classifier Properties are well understood and thoroughly explored in the literature Training data are summarized, classification of test samples is rapid Provably optimal when the multivariate feature distribution is known for each class
Class-conditional Distributions P(x) x MLE optimal when Equal prior probabilities Equal error costs Generalizes to d dimensions
Multiple Dimensions
The “Naïve” Bayes Classifier We now have P(xi|j) for each feature i and each class j Naïve approach: assume all features are independent This assumption is almost always false As long as monotonicity holds, the decision rule is still valid
Unequal Prior Probabilities When we know the prior probabilities for the classes, we can use Bayes rule: Bayes decision rule: assign to the class for which the posterior probability is highest
Hybridizing the Bayes Classifier P(80 < x < 100) = 0.045 0 20 40 60 80 100 120 140 160 180 200 + Unfortunately the Bayes classifier is invariant to feature weighting Proportion of Training Samples P(8 < x < 10) = 0.045 0 2 4 6 8 10 12 14 16 18 20 + Feature Value
Bayes Discriminant Function Bayes Decision Rule: Two-class Discriminant Function:
Naïve Bayes Discriminant Independence Assumption:
A Parameterized Discriminant C1, C2 … Cd are optimized by an evolutionary algorithm.
An Alternative Discriminant Sum of weighted marginal probabilities
Conserved vs. Non-Conserved Can improve accuracy by using k value as a confidence indicator and allowing “don’t know” results Higher weights Lower weights 0.000
Cosine-based kNN Classifier A cosine similarity metric finds k points with the most similar angles. The cosine between 2 vectors xi and xj is defined as: Class A Class B Test Pattern Feature 1 Feature 2 k=5 classification: Among 5 points with the smallest angles with respect to the test point, 3 are class A; the test point is labeled as class A. Once the k most similar points have been identified, the class label is assigned according to the function: where c(xi) = 1 if xi belongs to the positive class; -1 if xi belongs to the negative class. If q is positive, the query point is assigned to the positive class, otherwise it is assigned to the negative class.
Feature Extraction Techniques Assigning different weights to each feature may also affect classification, to a lesser extent. Shifting the origin in the search space may affect classification. Feature 2 Feature 1 Class A Class B Test Pattern Feature 2 Feature 2 Feature 1 Origin Shift Feature 2 (extended) After the origin is shifted above, the test point, originally labeled class A, is relabeled class B. After feature 2 is extended above, the test point, originally labeled class B, is relabeled class A.
GA/Classifier Hybrid Architecture Offset Vector Feature offsets for cosine point of reference during classification Genetic Algorithm W1W2...W8 O1O2...O8 K K is also optimized Cosine KNN Classifier W1W2...W8 O1O2...O8 K W1W2...W8 O1O2...O8 K Population of feature weight, offsets, & K W1W2...W8 O1O2...O8 K W2, O2 ... ... ... W1, O1 Fitness Based on the number of correct predictions using the weight vector & the number of masked features Weight Vector Weights to use for each feature axis during classification
Population-Adaptive Mutation When a feature is chosen for mutation, its range for possible mutation depends on the variance of that feature across the genetic algorithm’s population. In early generations, variance is high, so the range is larger. mutation range min max Later, as the population begins to converge, variance decreases and the range is smaller. mutation range min max
Probe Site Generation Aspartic protease (2apr) with crystallographically observed (yellow) and computer-generated (green) water molecules.
Standard Classifier Results All above classifiers are from the WEKA collection of machine learning algorithms and use 10-fold cross validation, except for the Euclidian kNN/GA classifier, which is another EC optimized classifier from Michael Raymer’s research; it uses bootstrap validation.
cosKnn/GA Classifier Results All above results are from runs during July 2002. The training sets from each run were not preserved. 0.000 Higher weights Lower weights
Results
Protein Structure Modeling The holy grail of computational biology: Given the sequence of amino acids in a protein, how will it fold? TATAAGCTGACTGTCACTGA Traditional Fold Recognition
Target-to-Template Alignment Protein Structure Modeling fold recognition – profile vs. profile structure 1 structure 2 align, score Target Protein : Automated structure x structure y : Target-to-Template Alignment optimize alignment Traditional Expert Section of modeling This is what we’re automating MST automates the evaluation of the core evaluate the core fragment selection Expert key residue comparison final model
The first four strands of OB-folds... shown is the superimposed core of 20 OB-folds – these fragments define the family Red – first beta strand – colored like that just for visualization purpose Molnir automatically locates the common core, from a multiple structure alignment 20 cores shown.
The fifth strand, clustered Clustering a variable secondary structure unit want to select one of these three, for each chimeric structure Molnir automatically clusters
The second helix, clustered Another variable secondary structure unit (vSSU) shown clustered With 3 of the last vSSU, and 6 of this one, 18 combinations possible, some of which are not realized in nature. There’s also a third vSSU, and a total of 8 loops. If any set of loops can occur with any set of SSUs, we have hundreds of millions of possible combinations. If we use only representative structures (1 from each cluster), still have millions of possible combinations.
Selecting a Model A genetic algorithm evolves a population of models Start with ~50; original 20 and 30 random Double the population size by mutation and cross-over Test each of the new structures – dismiss half Fitness Function quickly search through millions of possibilities using a GA. Alignment is used as the fitness function (Target-to-Template Alignment in slide 1). We will eventually use the MST, because it searches for an alignment that has a good hydrophobic core. (see slide 2, “evaluate the core)