THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU Putting Engineering back into Protein Engineering Jun Liao, UC Santa Cruz Manfred K. Warmuth, UC Santa Cruz.

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Proteomics Examination Yvonne (Bonnie) Eyler Technology Center 1600 Art Unit 1646 (703)
Yue Han and Lei Yu Binghamton University.
Proteomics and “Orphan” Receptors Yvonne (Bonnie) Eyler Technology Center 1600 Art Unit 1646 (703)
Measuring the degree of similarity: PAM and blosum Matrix
High Throughput Computing and Protein Structure Stephen E. Hamby.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Structural bioinformatics
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Bioinformatics and Phylogenetic Analysis
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
1 Directed Mutagenesis and Protein Engineering. 2 Mutagenesis Mutagenesis -> change in DNA sequence -> Point mutations or large modifications Point mutations.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Discovery of RNA Structural Elements Using Evolutionary Computation Authors: G. Fogel, V. Porto, D. Weekes, D. Fogel, R. Griffey, J. McNeil, E. Lesnik,
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Combinatorial Chemistry and Library Design
Multiple Sequence Alignment School of B&I TCD May 2010.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Estimating fitness landscapes John Pinney
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
Design of Experiments DoE Antonio Núñez, ULPGC. Objectives of DoE in D&M Processes, Process Investigation, Product and Process Q-improvement, Statistical.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Phylogeny Ch. 7 & 8.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Learning Chaotic Dynamics from Time Series Data A Recurrent Support Vector Machine Approach Vinay Varadan.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Copyright OpenHelix. No use or reproduction without express written consent1.
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Page 1 Computer-aided Drug Design —Profacgen. Page 2 The most fundamental goal in the drug design process is to determine whether a given compound will.
Unsupervised Learning
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Boosting and Additive Trees (2)
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
Molecular Docking Profacgen. The interactions between proteins and other molecules play important roles in various biological processes, including gene.
Extra Tree Classifier-WS3 Bagging Classifier-WS3
1 Department of Engineering, 2 Department of Mathematics,
Bioinformatics Biological Data Computer Calculations +
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Directed Mutagenesis and Protein Engineering
Sahand Kashani, Stuart Byma, James Larus 2019/02/16
Basic Local Alignment Search Tool
Unsupervised Learning
Presentation transcript:

THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU Putting Engineering back into Protein Engineering Jun Liao, UC Santa Cruz Manfred K. Warmuth, UC Santa Cruz Jeremy Minshull, DNA 2.0

Protein Engineering Current Paradigms 1.Mechanism-based –(Rational) detailed structural analysis 2.Empiricism-based –(Non-rational ) libraries based

Mechanism-Based Protein Engineering Based on thermodynamic principles Calculations are approximate –calculation cost –structures are really not rigid (MDS) Calculations are primarily able to predict binding –catalysis is a special case of binding to a transition state Changes in amino acids are designed based on these principles –very small numbers (<5) of new proteins are synthesized and tested

Empiricism-Based Protein Engineering Uses similar principles to evolution –make many variants –screen to find those with the best properties No mechanistic understanding needed Produces large numbers of variants (>1,000) which are very difficult / expensive to screen for practically relevant properties Proteins related to wild type Simulated cross over New variants

The Key Challenge in Protein Engineering = Reality What we need is not what we assay for…. Molecular mechanistic models (does not model activity) High throughput screens (surrogate assays)

Wish List No need to develop surrogate assay Variants are tested directly under application conditions Rapid process. Requirements Identification of appropriate amino acid substitutions Design and synthesis of information-rich variants Interpretation of quantitative functional data using machine learning techniques. What we want in Protein Engineering

Protein Engineering using Machine Learning Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties

Engineering of Proteinase K Long-term goal of engineering proteinase K to degrade polylactic acid Member of the serine protease family –Large amounts of phylogenetic and sequence information available Several different measurable activities available for optimization

Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties Protein Engineering using Machine Learning

Expert System for Substitution Selection Expert system: - Calculation of 9 independent scores that measure changes that have succeeded in other places in Nature - Weight and combine scores to pick best changes Proteins related to proteinaseK 19 switches = search space of 2 19 = 500,000 ? ? ? ? ?

Finding Optima in Complex Landscapes: Design of Experiment Changing 1 amino acid at a time Making multiple changes simultaneously …Now try to envision doing this not with 2, but 200 amino acids / dimensions x x x x x x x x x x Aa 2 Aa 1 x x x x x x x Aa 2 Aa 1

Design of Initial Proteinase K Variants

Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Back to Proteinase K Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties Protein Engineering using Machine Learning

First proteinase K dataset

Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties Protein Engineering using Machine Learning

Sequence-Activity Modeling: How Does it Work? 1. Represent the sequence as a matrix Seq1 AGRWGIGAYHKLIMA Seq2 AGRTGVGVYHKLIMA Seq3 AGRWGIGVYHRLIMA Seq4 AGRTGVGAYHRLIMA becomes T W V I V A R K x x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Seq Seq Seq Seq Measure the activity or activities of interest under the final application conditions 3. y = c 1 x 1 + c 2 x 2 + c 3 x 3 + c 4 x 4 +… c i x i

Predicted activity Measured activity Assessing the Proteinase K Sequence-Activity Relationship wt y = c 1 x 1 + c 2 x 2 + c 3 x 3 + c 4 x 4 +… c i x i

Learning Methods Variety of regression methods –Ridge Regression & Lasso –SVM Regression & LPSVM Regression –Matching Loss Regression & One-norm Matching Loss Regression –Partial Least Square Regression –LPBoost Regression Use bagging to improve the prediction stability

Variants Design I Main issue: Exploitation vs. Exploration Optimum design (Exploitation) –Take the combination of substitutions predicted to have maximal activity –Also consider Substitution frequency in the dataset Variation of weight estimation. –Used in 2nd & 3rd iterations

Variants Design II Diversity design (Exploration) – Calculate the combination of substitutions predicted to have maximal activity that is also No more than 5 changes from a sequence that has already been tested No closer than 3 changes from a sequence that has already been tested or selected for synthesis –Used in 2nd iteration

8090 Activity relative to wild type Three Iterations of Activity Engineering Variants in order synthesized st set: 34 variants 2 nd set: 24 variants 3 rd set: 38 variants wild-type 100 ONLY 58 variants were tested to allow design of the fourth set, which contained 3 variants x improved over wild-type 50% of variants more active than the best of previous sets 70% of variants more active than wild types 3-11 changes found in variants better than WT

Improving Activity Activity Improvement

Activity (pmol/s/ml) Activity (pmol/s/ml) Half life at 68°C (s) Variants are Improved in Multiple Properties

Conclusions Machine learning –Making a very small number of variants (58) allows a productive search of a total space with 500,000 possible combinations Synthetic Biology –Recent advances in gene synthesis methods were essential for this type of exploration

The Future Proteins are the building blocks of life with a wide array of applications (therapeutics, diagnostics, industrial catalysts) Finding a reliable mechanism for optimizing proteins for human applications would be an amazing feat We steal ideas about how proteins evolve from nature, but optimize proteins outside their in vivo constraints (the proteins don’t have to be compatible with life)