Download presentation
1
EC in Bioinformatics and Drug Design
Dr. Michael L. Raymer Department of Computer Science and Engineering
2
Bioinformatics Computational Molecular Biology Bioinformatics Genomics
Functional genomics Proteomics Structural bioinformatics
3
DNA is the blueprint for life
Every cell in your body has 23 chromosomes in the nucleus The genes in these chromosomes determine all of your physical attributes. Image source: Crane digital,
4
Mapping the Genome The human genome project has provided us with a draft of the entire human genome. Four bases: A, T, C, G 3.12 billion base-pairs 99% of these are the same Polymorphisms = where they differ As of July, 2000 Everything that makes you different from every other human (as far as we know) is encoded in this information.
5
How does the code work? Template for construction of proteins
Inherited disease: broken dna broken proteins Viral disease: dna/rna foreign proteins Bacterial disease: organisms with their own proteins
6
The genome is information:
TATAAGCTGACTGTCACTGA 4 Bases: A,G,C,T one codon 20 Amino Acids 3apr.pdb The representation is much more complex in many ways than in EC
7
Proteins: Molecular machinery
Proteins in your muscles allows you to move: myosin and actin
8
Proteins: Molecular machinery
Enzymes (digestion, catalysis) Structure (collagen)
9
Proteins: Molecular machinery
Signaling (hormones, kinases) Transport (energy, oxygen) Image source: Crane digital,
10
Growth of biological databases
GenBank BASEPAIR GROWTH Source: GenBank 3D Structures Growth: Source:
11
Applications What can we do with all this information?
Cure diseases – computational drug design Model protein structure Understand relationships between organisms (phylogenies)
12
Example Case: HIV Protease
Exposure & infection HIV enters your cell Your own cell reads the HIV “code” and creates the HIV proteins. New viral proteins prepare HIV for infection of other cells. 3 most important classes of human disease Genetic Viral Bacterial © George Eade, Eade Creative Services, Inc.
13
HIV Protease as a drug target
Many drugs bind to protein active sites. This HIV protease can no longer prepare HIV proteins for infection, because an inhibitor is already bound in its active site. Genetic: Fix DNA, introduce new DNA, or synthesize the missing protein Example: Diabetes – use insulin now, but introducing DNA would be a better solution Viral and bacterial – protein drug targets HIV Protease + Peptidyl inhibitor (1A8G.PDB)
14
Drug Discovery Target Identification Lead discovery & optimization
What protein can we attack to stop the disease from progressing? Lead discovery & optimization What sort of molecule will bind to this protein? Toxicology Does it kill the patient? Does it have side effects? Does it get to the problem spots?
15
Drug Development Life Cycle
Discovery (2 to 10 Years) Preclinical Testing (Lab and Animal Testing) Phase I (20-30 Healthy Volunteers used to check for safety and dosage) Phase II ( Patient Volunteers used to check for efficacy and side effects) Phase III ( Patient Volunteers used to monitor reactions to long-term drug use) $ Million! FDA Review & Approval Post-Marketing Testing Years 7 – 15 Years!
16
Drug lead screening 5,000 to 10,000 compounds screened
250 Lead Candidates in Preclinical Testing 5 Drug Candidates enter Clinical Testing; 80% Pass Phase I 30%Pass Phase II 80% Pass Phase III One drug approved by the FDA
17
Finding drug leads Once we have a target, how do we find some compounds that might bind to it? The old way: exhaustive screening The new way: computational screening!
18
Chemistry 101 Like everything else in the universe, proteins and drugs are made up of atoms. Some atoms, like oxygen, tend to have a negative charge. Some, like nitrogen, tend to be positively charged. When the two come together, they attract like magnets.
19
The Goal: Drug lead Protein
20
Drug Lead Screening & Docking
? When searching for a drug lead Electrostatic complementary Hydrogen bonding potential SHAPE COMPLEMENTARITY HYDROPHOBIC INTERACTION Complementarity Shape Chemical Electrostatic
21
But it gets more complicated
About 60% of our body weight is water. Most proteins are surrounded by water molecules Interact with protein-drug complexes
22
Protein-water interactions
Water has both negative and positively charged atoms. It can bridge gaps between drugs and proteins. Protein surface Ligand Water molecule
23
Will the water stay? When a drug comes close to a protein, some of the water molecules are displaced. Protein surface Ligand Water molecule
24
Pattern Recognition Model
Raw measurement data Transducer, etc. Sample f1 f2 f3 f4 f5 Feature vector = pattern Before I tell you how we improved on the pattern recognizers we used… Define the pattern recognition problem terminology d
25
Training and Testing C N Classifier Labeled training data ?
Cube N C N C N C Labeled training data f1 f2 f3 f4 f5 ? Classifier Classification/ prediction
26
Temperature Factor (B-Value)
How wiggly is it? Here a protein (dihydrofolate reductase) is colored by temperature factor.
27
Atomic Density (ADN) How crowded is it?
The atomic density of this water molecule is 5.
28
Prediction of water molecules
Blue spheres are water molecules predicted to stay in the active site. Wire mesh spheres are water molecules predicted to be displaced — booted out by the ligand.
29
Nearest Neighbor Classification
Feature 1 Feature 2 The specific pattern recognizer we chose to experiment with first was knn Simple Well understood in the literature Good results = class 1 training sample = class 2 training sample = new unknown (test) sample
30
Feature Weighted knn a. b. Scale Extended Class 1 Class 2 Unknown
Weight = 0 Don’t use feature GOAL: Minimum set of features that classifies well Faster calculation Data mining Sometimes even better accuracy Feature 1 Feature 2 Scale Extended
31
GA & knn Interaction ... Masked Weight Vector & k
Genetic Algorithm Masked Weight Vector & k W1 W2 W3 W4 W5 KNN Classifier W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W2 ... W1 Fitness — How is it calculated?
32
Weighting and Masking How do we sample feature subsets?
Weight below a threshold value: slow sampling Masking: Distinct mutation rates, or multiple mask bits Intron effect Classifier parameters (k) on the chromosome W1 W2 W3 W4 W5 M1 M2 M3 M4 M5 k 73.2
33
The Cost Function We can direct the search toward any objective.
Classification accuracy Class balance Feature subset parsimony (reduce d) The GA minimizes the cost function:
34
UCI Data Set Results Typically we see modest gains in classification accuracy, with significant reduction in features. Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences.
35
Classical Methods & SFFS
Thyroid data Best feature selection performance within 1% of best accuracy Appendicitis data Best feature selection Weiss, S. and Kapouleas, I. (1990). An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods. Morgan Kaufmann.
36
Feature Extraction Framework
Features and Classifier Parameters Genetic Algorithm Classifier ? ... Fitness based on accuracy
37
Reducing Computation Time
Most of the compute time is spent calculating distances and finding nearest neighbors. Branch and bound knn[1] scales: linearly with d, polynomial with n, O(nc) 1.0 < c < 2.0 Can we use a faster classifier? [1] Fukunaga, K. and Narendra, P. M. (1975). A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, 750–753.
38
The Bayes Classifier Properties are well understood and thoroughly explored in the literature Training data are summarized, classification of test samples is rapid Provably optimal when the multivariate feature distribution is known for each class
39
Class-conditional Distributions
P(x) x MLE optimal when Equal prior probabilities Equal error costs Generalizes to d dimensions
40
Multiple Dimensions
41
The “Naïve” Bayes Classifier
We now have P(xi|j) for each feature i and each class j Naïve approach: assume all features are independent This assumption is almost always false As long as monotonicity holds, the decision rule is still valid
42
Unequal Prior Probabilities
When we know the prior probabilities for the classes, we can use Bayes rule: Bayes decision rule: assign to the class for which the posterior probability is highest
43
Hybridizing the Bayes Classifier
P(80 < x < 100) = 0.045 Unfortunately the Bayes classifier is invariant to feature weighting Proportion of Training Samples P(8 < x < 10) = 0.045 Feature Value
44
Bayes Discriminant Function
Bayes Decision Rule: Two-class Discriminant Function:
45
Naïve Bayes Discriminant
Independence Assumption:
46
A Parameterized Discriminant
C1, C2 … Cd are optimized by an evolutionary algorithm.
47
An Alternative Discriminant
Sum of weighted marginal probabilities
48
Conserved vs. Non-Conserved
Can improve accuracy by using k value as a confidence indicator and allowing “don’t know” results Higher weights Lower weights 0.000
49
Cosine-based kNN Classifier
A cosine similarity metric finds k points with the most similar angles. The cosine between 2 vectors xi and xj is defined as: Class A Class B Test Pattern Feature 1 Feature 2 k=5 classification: Among 5 points with the smallest angles with respect to the test point, 3 are class A; the test point is labeled as class A. Once the k most similar points have been identified, the class label is assigned according to the function: where c(xi) = 1 if xi belongs to the positive class; -1 if xi belongs to the negative class. If q is positive, the query point is assigned to the positive class, otherwise it is assigned to the negative class.
50
Feature Extraction Techniques
Assigning different weights to each feature may also affect classification, to a lesser extent. Shifting the origin in the search space may affect classification. Feature 2 Feature 1 Class A Class B Test Pattern Feature 2 Feature 2 Feature 1 Origin Shift Feature 2 (extended) After the origin is shifted above, the test point, originally labeled class A, is relabeled class B. After feature 2 is extended above, the test point, originally labeled class B, is relabeled class A.
51
GA/Classifier Hybrid Architecture
Offset Vector Feature offsets for cosine point of reference during classification Genetic Algorithm W1W2...W8 O1O2...O8 K K is also optimized Cosine KNN Classifier W1W2...W8 O1O2...O8 K W1W2...W8 O1O2...O8 K Population of feature weight, offsets, & K W1W2...W8 O1O2...O8 K W2, O2 ... ... ... W1, O1 Fitness Based on the number of correct predictions using the weight vector & the number of masked features Weight Vector Weights to use for each feature axis during classification
52
Population-Adaptive Mutation
When a feature is chosen for mutation, its range for possible mutation depends on the variance of that feature across the genetic algorithm’s population. In early generations, variance is high, so the range is larger. mutation range min max Later, as the population begins to converge, variance decreases and the range is smaller. mutation range min max
53
Probe Site Generation Aspartic protease (2apr) with crystallographically observed (yellow) and computer-generated (green) water molecules.
54
Standard Classifier Results
All above classifiers are from the WEKA collection of machine learning algorithms and use 10-fold cross validation, except for the Euclidian kNN/GA classifier, which is another EC optimized classifier from Michael Raymer’s research; it uses bootstrap validation.
55
cosKnn/GA Classifier Results
All above results are from runs during July The training sets from each run were not preserved. 0.000 Higher weights Lower weights
56
Results
57
Protein Structure Modeling
The holy grail of computational biology: Given the sequence of amino acids in a protein, how will it fold? TATAAGCTGACTGTCACTGA Traditional Fold Recognition
58
Target-to-Template Alignment
Protein Structure Modeling fold recognition – profile vs. profile structure 1 structure 2 align, score Target Protein : Automated structure x structure y : Target-to-Template Alignment optimize alignment Traditional Expert Section of modeling This is what we’re automating MST automates the evaluation of the core evaluate the core fragment selection Expert key residue comparison final model
59
The first four strands of OB-folds...
shown is the superimposed core of 20 OB-folds – these fragments define the family Red – first beta strand – colored like that just for visualization purpose Molnir automatically locates the common core, from a multiple structure alignment 20 cores shown.
60
The fifth strand, clustered
Clustering a variable secondary structure unit want to select one of these three, for each chimeric structure Molnir automatically clusters
61
The second helix, clustered
Another variable secondary structure unit (vSSU) shown clustered With 3 of the last vSSU, and 6 of this one, 18 combinations possible, some of which are not realized in nature. There’s also a third vSSU, and a total of 8 loops. If any set of loops can occur with any set of SSUs, we have hundreds of millions of possible combinations. If we use only representative structures (1 from each cluster), still have millions of possible combinations.
62
Selecting a Model A genetic algorithm evolves a population of models
Start with ~50; original 20 and 30 random Double the population size by mutation and cross-over Test each of the new structures – dismiss half Fitness Function quickly search through millions of possibilities using a GA. Alignment is used as the fitness function (Target-to-Template Alignment in slide 1). We will eventually use the MST, because it searches for an alignment that has a good hydrophobic core. (see slide 2, “evaluate the core)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.