THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU Putting Engineering back into Protein Engineering Jun Liao, UC Santa Cruz Manfred K. Warmuth, UC Santa Cruz Jeremy Minshull, DNA 2.0
Protein Engineering Current Paradigms 1.Mechanism-based –(Rational) detailed structural analysis 2.Empiricism-based –(Non-rational ) libraries based
Mechanism-Based Protein Engineering Based on thermodynamic principles Calculations are approximate –calculation cost –structures are really not rigid (MDS) Calculations are primarily able to predict binding –catalysis is a special case of binding to a transition state Changes in amino acids are designed based on these principles –very small numbers (<5) of new proteins are synthesized and tested
Empiricism-Based Protein Engineering Uses similar principles to evolution –make many variants –screen to find those with the best properties No mechanistic understanding needed Produces large numbers of variants (>1,000) which are very difficult / expensive to screen for practically relevant properties Proteins related to wild type Simulated cross over New variants
The Key Challenge in Protein Engineering = Reality What we need is not what we assay for…. Molecular mechanistic models (does not model activity) High throughput screens (surrogate assays)
Wish List No need to develop surrogate assay Variants are tested directly under application conditions Rapid process. Requirements Identification of appropriate amino acid substitutions Design and synthesis of information-rich variants Interpretation of quantitative functional data using machine learning techniques. What we want in Protein Engineering
Protein Engineering using Machine Learning Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties
Engineering of Proteinase K Long-term goal of engineering proteinase K to degrade polylactic acid Member of the serine protease family –Large amounts of phylogenetic and sequence information available Several different measurable activities available for optimization
Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties Protein Engineering using Machine Learning
Expert System for Substitution Selection Expert system: - Calculation of 9 independent scores that measure changes that have succeeded in other places in Nature - Weight and combine scores to pick best changes Proteins related to proteinaseK 19 switches = search space of 2 19 = 500,000 ? ? ? ? ?
Finding Optima in Complex Landscapes: Design of Experiment Changing 1 amino acid at a time Making multiple changes simultaneously …Now try to envision doing this not with 2, but 200 amino acids / dimensions x x x x x x x x x x Aa 2 Aa 1 x x x x x x x Aa 2 Aa 1
Design of Initial Proteinase K Variants
Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Back to Proteinase K Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties Protein Engineering using Machine Learning
First proteinase K dataset
Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. Machine learning Model the effect of sequence changes on function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate End Select the best variant(s). Starting point Select a protein with some correct initial properties Protein Engineering using Machine Learning
Sequence-Activity Modeling: How Does it Work? 1. Represent the sequence as a matrix Seq1 AGRWGIGAYHKLIMA Seq2 AGRTGVGVYHKLIMA Seq3 AGRWGIGVYHRLIMA Seq4 AGRTGVGAYHRLIMA becomes T W V I V A R K x x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Seq Seq Seq Seq Measure the activity or activities of interest under the final application conditions 3. y = c 1 x 1 + c 2 x 2 + c 3 x 3 + c 4 x 4 +… c i x i
Predicted activity Measured activity Assessing the Proteinase K Sequence-Activity Relationship wt y = c 1 x 1 + c 2 x 2 + c 3 x 3 + c 4 x 4 +… c i x i
Learning Methods Variety of regression methods –Ridge Regression & Lasso –SVM Regression & LPSVM Regression –Matching Loss Regression & One-norm Matching Loss Regression –Partial Least Square Regression –LPBoost Regression Use bagging to improve the prediction stability
Variants Design I Main issue: Exploitation vs. Exploration Optimum design (Exploitation) –Take the combination of substitutions predicted to have maximal activity –Also consider Substitution frequency in the dataset Variation of weight estimation. –Used in 2nd & 3rd iterations
Variants Design II Diversity design (Exploration) – Calculate the combination of substitutions predicted to have maximal activity that is also No more than 5 changes from a sequence that has already been tested No closer than 3 changes from a sequence that has already been tested or selected for synthesis –Used in 2nd iteration
8090 Activity relative to wild type Three Iterations of Activity Engineering Variants in order synthesized st set: 34 variants 2 nd set: 24 variants 3 rd set: 38 variants wild-type 100 ONLY 58 variants were tested to allow design of the fourth set, which contained 3 variants x improved over wild-type 50% of variants more active than the best of previous sets 70% of variants more active than wild types 3-11 changes found in variants better than WT
Improving Activity Activity Improvement
Activity (pmol/s/ml) Activity (pmol/s/ml) Half life at 68°C (s) Variants are Improved in Multiple Properties
Conclusions Machine learning –Making a very small number of variants (58) allows a productive search of a total space with 500,000 possible combinations Synthetic Biology –Recent advances in gene synthesis methods were essential for this type of exploration
The Future Proteins are the building blocks of life with a wide array of applications (therapeutics, diagnostics, industrial catalysts) Finding a reliable mechanism for optimizing proteins for human applications would be an amazing feat We steal ideas about how proteins evolve from nature, but optimize proteins outside their in vivo constraints (the proteins don’t have to be compatible with life)