EC in Bioinformatics and Drug Design

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Analysis of High-Throughput Screening Data C371 Fall 2004.

Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

DECISION TREES. Decision trees  One possible representation for hypotheses.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

Data Mining Classification: Alternative Techniques

Indian Statistical Institute Kolkata

Reduced Support Vector Machine

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.

The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.

Similar Sequence Similar Function Charles Yan Spring 2006.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

CS Instance Based Learning1 Instance Based Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Exploration Session Week 8: Computational Biology Melissa Winstanley: (based on slides by Martin Tompa,

Protein Tertiary Structure Prediction

Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.

JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

This week: overview on pattern recognition (related to machine learning)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

A genetic algorithm for structure based de-novo design Scott C.-H. Pegg, Jose J. Haresco & Irwin D. Kuntz February 21, 2006.

CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.

CS 790 – Bioinformatics Introduction and overview.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Section 2 Genetics and Biotechnology DNA Technology

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)

From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?

Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.

KEY CONCEPT Biotechnology relies on cutting DNA at specific places.

Linear Models for Classification

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.

Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Locating and sequencing genes

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.

KNN & Naïve Bayes Hongning Wang

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

CSE 4705 Artificial Intelligence

Results for all features Results for the reduced set of features

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Trees, bagging, boosting, and stacking

APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY

Biotechnology Objectives: At the end of this lecture we will be able to identify and describe the uses of biotechnology in society.

Evaluating classifiers for disease gene discovery

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Parametric Methods Berlin Chen, 2005 References:

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

EC in Bioinformatics and Drug Design Dr. Michael L. Raymer Department of Computer Science and Engineering

Bioinformatics Computational Molecular Biology Bioinformatics Genomics Functional genomics Proteomics Structural bioinformatics

DNA is the blueprint for life Every cell in your body has 23 chromosomes in the nucleus The genes in these chromosomes determine all of your physical attributes. Image source: Crane digital, http://www.cranedigital.com/

Mapping the Genome The human genome project has provided us with a draft of the entire human genome. Four bases: A, T, C, G 3.12 billion base-pairs 99% of these are the same Polymorphisms = where they differ As of July, 2000 Everything that makes you different from every other human (as far as we know) is encoded in this information.

How does the code work? Template for construction of proteins Inherited disease: broken dna  broken proteins Viral disease: dna/rna  foreign proteins Bacterial disease: organisms with their own proteins

The genome is information: TATAAGCTGACTGTCACTGA 4 Bases: A,G,C,T one codon 20 Amino Acids 3apr.pdb The representation is much more complex in many ways than in EC

Proteins: Molecular machinery Proteins in your muscles allows you to move: myosin and actin

Proteins: Molecular machinery Enzymes (digestion, catalysis) Structure (collagen)

Proteins: Molecular machinery Signaling (hormones, kinases) Transport (energy, oxygen) Image source: Crane digital, http://www.cranedigital.com/

Growth of biological databases GenBank BASEPAIR GROWTH Source: GenBank 3D Structures Growth: Source: http://www.rcsb.org/pdb/holdings.html

Applications What can we do with all this information? Cure diseases – computational drug design Model protein structure Understand relationships between organisms (phylogenies)

Example Case: HIV Protease Exposure & infection HIV enters your cell Your own cell reads the HIV “code” and creates the HIV proteins. New viral proteins prepare HIV for infection of other cells. 3 most important classes of human disease Genetic Viral Bacterial © George Eade, Eade Creative Services, Inc. http://whyfiles.org/035aids/index.html

HIV Protease as a drug target Many drugs bind to protein active sites. This HIV protease can no longer prepare HIV proteins for infection, because an inhibitor is already bound in its active site. Genetic: Fix DNA, introduce new DNA, or synthesize the missing protein Example: Diabetes – use insulin now, but introducing DNA would be a better solution Viral and bacterial – protein drug targets HIV Protease + Peptidyl inhibitor (1A8G.PDB)

Drug Discovery Target Identification Lead discovery & optimization What protein can we attack to stop the disease from progressing? Lead discovery & optimization What sort of molecule will bind to this protein? Toxicology Does it kill the patient? Does it have side effects? Does it get to the problem spots?

Drug Development Life Cycle Discovery (2 to 10 Years) Preclinical Testing (Lab and Animal Testing) Phase I (20-30 Healthy Volunteers used to check for safety and dosage) Phase II (100-300 Patient Volunteers used to check for efficacy and side effects) Phase III (1000-5000 Patient Volunteers used to monitor reactions to long-term drug use) $600-700 Million! FDA Review & Approval Post-Marketing Testing Years 0 2 4 6 8 10 12 14 16 7 – 15 Years!

Drug lead screening 5,000 to 10,000 compounds screened 250 Lead Candidates in Preclinical Testing 5 Drug Candidates enter Clinical Testing; 80% Pass Phase I 30%Pass Phase II 80% Pass Phase III One drug approved by the FDA

Finding drug leads Once we have a target, how do we find some compounds that might bind to it? The old way: exhaustive screening The new way: computational screening!

Chemistry 101 Like everything else in the universe, proteins and drugs are made up of atoms. Some atoms, like oxygen, tend to have a negative charge. Some, like nitrogen, tend to be positively charged. When the two come together, they attract like magnets.

The Goal: Drug lead Protein

Drug Lead Screening & Docking ? When searching for a drug lead Electrostatic complementary Hydrogen bonding potential SHAPE COMPLEMENTARITY HYDROPHOBIC INTERACTION Complementarity Shape Chemical Electrostatic

But it gets more complicated About 60% of our body weight is water. Most proteins are surrounded by water molecules Interact with protein-drug complexes

Protein-water interactions Water has both negative and positively charged atoms. It can bridge gaps between drugs and proteins. Protein surface Ligand Water molecule

Will the water stay? When a drug comes close to a protein, some of the water molecules are displaced. Protein surface Ligand Water molecule

Pattern Recognition Model Raw measurement data Transducer, etc. Sample f1 f2 f3 f4 f5 Feature vector = pattern Before I tell you how we improved on the pattern recognizers we used… Define the pattern recognition problem terminology d

Training and Testing C N Classifier Labeled training data ? Cube N C N C N C Labeled training data f1 f2 f3 f4 f5 ? Classifier Classification/ prediction

Temperature Factor (B-Value) How wiggly is it? Here a protein (dihydrofolate reductase) is colored by temperature factor.

Atomic Density (ADN) How crowded is it? The atomic density of this water molecule is 5.

Prediction of water molecules Blue spheres are water molecules predicted to stay in the active site. Wire mesh spheres are water molecules predicted to be displaced — booted out by the ligand.

Nearest Neighbor Classification Feature 1 Feature 2 The specific pattern recognizer we chose to experiment with first was knn Simple Well understood in the literature Good results = class 1 training sample = class 2 training sample = new unknown (test) sample

Feature Weighted knn a. b. Scale Extended Class 1 Class 2 Unknown Weight = 0  Don’t use feature GOAL: Minimum set of features that classifies well Faster calculation Data mining Sometimes even better accuracy Feature 1 Feature 2 Scale Extended

GA & knn Interaction ... Masked Weight Vector & k Genetic Algorithm Masked Weight Vector & k W1 W2 W3 W4 W5 KNN Classifier W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W2 ... W1 Fitness — How is it calculated?

Weighting and Masking How do we sample feature subsets? Weight below a threshold value: slow sampling Masking: Distinct mutation rates, or multiple mask bits Intron effect Classifier parameters (k) on the chromosome W1 W2 W3 W4 W5 M1 M2 M3 M4 M5 k 73.2

The Cost Function We can direct the search toward any objective. Classification accuracy Class balance Feature subset parsimony (reduce d) The GA minimizes the cost function:

UCI Data Set Results Typically we see modest gains in classification accuracy, with significant reduction in features. Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences. http://www.ics.uci.edu/~mlearn/MLRepository.html

Classical Methods & SFFS Thyroid data Best feature selection performance within 1% of best accuracy Appendicitis data Best feature selection Weiss, S. and Kapouleas, I. (1990). An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods. Morgan Kaufmann.

Feature Extraction Framework Features and Classifier Parameters Genetic Algorithm Classifier ? ... Fitness based on accuracy

Reducing Computation Time Most of the compute time is spent calculating distances and finding nearest neighbors. Branch and bound knn[1] scales: linearly with d, polynomial with n, O(nc) 1.0 < c < 2.0 Can we use a faster classifier? [1] Fukunaga, K. and Narendra, P. M. (1975). A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, 750–753.

The Bayes Classifier Properties are well understood and thoroughly explored in the literature Training data are summarized, classification of test samples is rapid Provably optimal when the multivariate feature distribution is known for each class

Class-conditional Distributions P(x) x MLE optimal when Equal prior probabilities Equal error costs Generalizes to d dimensions

Multiple Dimensions

The “Naïve” Bayes Classifier We now have P(xi|j) for each feature i and each class j Naïve approach: assume all features are independent This assumption is almost always false As long as monotonicity holds, the decision rule is still valid

Unequal Prior Probabilities When we know the prior probabilities for the classes, we can use Bayes rule: Bayes decision rule: assign to the class for which the posterior probability is highest

Hybridizing the Bayes Classifier P(80 < x < 100) = 0.045 0 20 40 60 80 100 120 140 160 180 200 + Unfortunately the Bayes classifier is invariant to feature weighting Proportion of Training Samples P(8 < x < 10) = 0.045 0 2 4 6 8 10 12 14 16 18 20 + Feature Value

Bayes Discriminant Function Bayes Decision Rule: Two-class Discriminant Function:

Naïve Bayes Discriminant Independence Assumption:

A Parameterized Discriminant C1, C2 … Cd are optimized by an evolutionary algorithm.

An Alternative Discriminant Sum of weighted marginal probabilities

Conserved vs. Non-Conserved Can improve accuracy by using k value as a confidence indicator and allowing “don’t know” results Higher weights Lower weights 0.000

Cosine-based kNN Classifier A cosine similarity metric finds k points with the most similar angles. The cosine between 2 vectors xi and xj is defined as: Class A Class B Test Pattern Feature 1 Feature 2 k=5 classification: Among 5 points with the smallest angles with respect to the test point, 3 are class A; the test point is labeled as class A. Once the k most similar points have been identified, the class label is assigned according to the function: where c(xi) = 1 if xi belongs to the positive class; -1 if xi belongs to the negative class. If q is positive, the query point is assigned to the positive class, otherwise it is assigned to the negative class.

Feature Extraction Techniques Assigning different weights to each feature may also affect classification, to a lesser extent. Shifting the origin in the search space may affect classification. Feature 2 Feature 1 Class A Class B Test Pattern Feature 2 Feature 2 Feature 1 Origin Shift Feature 2 (extended) After the origin is shifted above, the test point, originally labeled class A, is relabeled class B. After feature 2 is extended above, the test point, originally labeled class B, is relabeled class A.

GA/Classifier Hybrid Architecture Offset Vector Feature offsets for cosine point of reference during classification Genetic Algorithm W1W2...W8 O1O2...O8 K K is also optimized Cosine KNN Classifier W1W2...W8 O1O2...O8 K W1W2...W8 O1O2...O8 K Population of feature weight, offsets, & K W1W2...W8 O1O2...O8 K W2, O2 ... ... ... W1, O1 Fitness Based on the number of correct predictions using the weight vector & the number of masked features Weight Vector Weights to use for each feature axis during classification

Population-Adaptive Mutation When a feature is chosen for mutation, its range for possible mutation depends on the variance of that feature across the genetic algorithm’s population. In early generations, variance is high, so the range is larger. mutation range min max Later, as the population begins to converge, variance decreases and the range is smaller. mutation range min max

Probe Site Generation Aspartic protease (2apr) with crystallographically observed (yellow) and computer-generated (green) water molecules.

Standard Classifier Results All above classifiers are from the WEKA collection of machine learning algorithms and use 10-fold cross validation, except for the Euclidian kNN/GA classifier, which is another EC optimized classifier from Michael Raymer’s research; it uses bootstrap validation.

cosKnn/GA Classifier Results All above results are from runs during July 2002. The training sets from each run were not preserved. 0.000 Higher weights Lower weights

Results

Protein Structure Modeling The holy grail of computational biology: Given the sequence of amino acids in a protein, how will it fold? TATAAGCTGACTGTCACTGA Traditional Fold Recognition

Target-to-Template Alignment Protein Structure Modeling fold recognition – profile vs. profile structure 1 structure 2 align, score Target Protein : Automated structure x structure y : Target-to-Template Alignment optimize alignment Traditional Expert Section of modeling This is what we’re automating MST automates the evaluation of the core evaluate the core fragment selection Expert key residue comparison final model

The first four strands of OB-folds... shown is the superimposed core of 20 OB-folds – these fragments define the family Red – first beta strand – colored like that just for visualization purpose Molnir automatically locates the common core, from a multiple structure alignment 20 cores shown.

The fifth strand, clustered Clustering a variable secondary structure unit want to select one of these three, for each chimeric structure Molnir automatically clusters

The second helix, clustered Another variable secondary structure unit (vSSU) shown clustered With 3 of the last vSSU, and 6 of this one, 18 combinations possible, some of which are not realized in nature. There’s also a third vSSU, and a total of 8 loops. If any set of loops can occur with any set of SSUs, we have hundreds of millions of possible combinations. If we use only representative structures (1 from each cluster), still have millions of possible combinations.

Selecting a Model A genetic algorithm evolves a population of models Start with ~50; original 20 and 30 random Double the population size by mutation and cross-over Test each of the new structures – dismiss half Fitness Function quickly search through millions of possibilities using a GA. Alignment is used as the fitness function (Target-to-Template Alignment in slide 1). We will eventually use the MST, because it searches for an alignment that has a good hydrophobic core. (see slide 2, “evaluate the core)