Thesis Defense.

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
Classification and risk prediction
CSE803 Fall Pattern Recognition Concepts Chapter 4: Shapiro and Stockman How should objects be represented? Algorithms for recognition/matching.
ROC Curves.
Jeremy Wyatt Thanks to Gavin Brown
Stockman CSE803 Fall Pattern Recognition Concepts Chapter 4: Shapiro and Stockman How should objects be represented? Algorithms for recognition/matching.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Napovedovanje imunskega odziva iz peptidnih mikromrež Mitja Luštrek 1 (2), Peter Lorenz 2, Felix Steinbeck 2, Georg Füllen 2, Hans-Jürgen Thiesen 2 1 Odsek.
Chapter 4 Pattern Recognition Concepts continued.
BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties
Evaluation – next steps
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
1 Pattern Recognition Concepts How should objects be represented? Algorithms for recognition/matching * nearest neighbors * decision tree * decision functions.
School of Pharmacy Medical University of Sofia
Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Protein-Protein Interaction Hotspots Carved into Sequences Yanay Ofran 1,2, Burkhard Rost 1,2,3 1.Department of Biochemistry and Molecular Biophysics,
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Data Mining and Decision Support
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
Bioinformatics in Vaccine Design
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Data Transformation: Normalization
ECE 471/571 - Lecture 19 Review 02/24/17.
Results for all features Results for the reduced set of features
Performance Evaluation 02/15/17
CSSE463: Image Recognition Day 11
Basic machine learning background with Python scikit-learn
Evaluating classifiers for disease gene discovery
Machine Learning Week 10.
Features & Decision regions
Extra Tree Classifier-WS3 Bagging Classifier-WS3
CSSE463: Image Recognition Day 11
Virtual Screening.
Nearest-Neighbor Classifiers
PCA, Clustering and Classification by Agnieszka S. Juncker
Experiments in Machine Learning
Evaluating Classifiers (& other algorithms)
ECE 471/571 – Review 1.
Parametric Estimation
Telling self from non-self: Learning the language of the Immune System
Dimension reduction : PCA and Clustering
Hairong Qi, Gonzalez Family Professor
Model generalization Brief summary of methods
CSSE463: Image Recognition Day 11
Roc curves By Vittoria Cozza, matr
CSSE463: Image Recognition Day 11
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
مادة المناعة-النظري\ المرحلة الثالثة
Modeling IDS using hybrid intelligent systems
Can Genetic Programming Do Manifold Learning Too?
COSC 4368 Intro Supervised Learning Organization
Support Vector Machines 2
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Midterm Review
Presentation transcript:

Thesis Defense

Parametrization and Classification of Peptides for the Detection of Linear B-cell Epitopes Johannes Söllner Peptide Classification, Johannes Söllner 01/02/2019

Scientific question(s) Can Machine Learning approaches improve state of the art B-cell epitope prediction for proteins/peptides? Which features are most relevant to make a peptide antigenic? Peptide Classification, Johannes Söllner 01/02/2019

Basic definitions Antigen: “any substance (as a toxin or enzyme) that stimulates the production of antibodies Epitope: “a molecule or portion of a molecule capable of binding to the combining site of an antibody. ...“ (B-cell epitopes) Structural epitopes: Binding surfaces formed by amino acids non-adjacent on the sequence level. (3D-conformation). Linear (sequential) epitopes: formed by adjacent amino acids We only investigated predictions neglecting 3D-structure i.e. only linear epitopes Peptide Classification, Johannes Söllner 01/02/2019

State of the Art 1. Published mainly in 1980s an 1990s 2. Almost exclusively monoparameteric propensity scales (hydrophilicity, secondary structure prevalence, ...) 3. No classification systems except for the “gold standard“ by Kolaskar and Tongaonkar (1990) Peptide Classification, Johannes Söllner 01/02/2019

Machine learning - principle Knowledge Based And it works like this: See, this is an antigen See, this is NOT an antigen And see, here I have another item: Is this an antigen? Peptide Classification, Johannes Söllner 01/02/2019

Translation to practice 1. Split protein to be investigated in overlapping peptides etc. Protein sequence 2. Calculate descriptors (values) for each peptide 3. Compare to descriptors of known epitopes an non-epitopes by means of machine learning 4. Assign the class (i.e. epitopic/non-epitopic) of the most similar examples. What are peptide descriptors/attributes/parameters? Peptide Classification, Johannes Söllner 01/02/2019

Peptide Parameters 1. Published propensity scales (hydrophilicity, flexibility,...)  Protscale 2. MOE based (small molecule attributes for amino acids) 3. Neighbourhood parameters 4. Amino acid frequencies (likelihood to be member of an epitope) 5. Complexity parameters  Calculated mean values for each peptide Peptide Classification, Johannes Söllner 01/02/2019

Focus: Neighbourhoods Definition of Neighbourhood Word m neighbours a central character i in a distance of x characters if it can be found in the same string x characters upstram AND/OR downstream of character i. (non exclusive OR) e.g. For m = AF, i = G, x = 3 AAGGIAFVAHGCKLAFAIAIAGGATGHA Calculated Probabilites Upstream Downstream Combined Mean between Upstream and Downstream Peptide Classification, Johannes Söllner 01/02/2019

Reduced alphabets (for neighbourhoods) Treat several amino acids as one – based e.g. on chemical similarity, hydrophilicity, flexibility, secondary structure prevalence, ... Blosum62 based Functional abstractions are possible Statistical basis of neighbourhoods improves Peptide Classification, Johannes Söllner 01/02/2019

Total number of parameters In total 18920 attributes  Selection necessary Peptide Classification, Johannes Söllner 01/02/2019

The dataset Machine learning is empirical i.e. examples have to be provided BciPep: 211 epitopes (multi organism) FIMM: 293 Intercell Antigen Identification Program (AIP): 1333 (13 bacterial pathogens) Reference (non immunogenic) peptides: Excised from AIP unselected proteins but screened organisms Randomly excised from proteins Excised peptides had the same length as epitopes Balanced, i.e. The same number as epitopes and references Peptide Classification, Johannes Söllner 01/02/2019

Separation Power of Parameters Between epitopes and non-epitopes Peptide Classification, Johannes Söllner 01/02/2019

Parameter Ranking mc mean correct sd standard deviation mcv mean correct (validation) Peptide Classification, Johannes Söllner 01/02/2019

Physico-Chemical Parameters Neighbourhood parameters are at least equally prominent The best physico-chemical parameters MC – Mean value of correct classifications Peptide Classification, Johannes Söllner 01/02/2019

Parameter Selection Too many parameters for practical application Selection of complementary and powerful parameters The Waikato Environment for Knowledge Analysis (WEKA) Based selection on multiple samples Peptide Classification, Johannes Söllner 01/02/2019

Parameter Selection Peptide Classification, Johannes Söllner 01/02/2019

Creation of Classifiers What is a Classifier? A Black box telling (predicting) the class of a data item 14 different parameter selections, 4 – 80 members per set Dimension reduction through PCA (Principal Component Analysis) Used classification algorithms J48/C4.5 (decision trees) IBk (nearest neighbours classifiers) More powerful methods exist (e.g. Neural networks), but lead to less understandable models. For each of the two algorithms, different setups (e.g. algorithm options) were possible, resulting in more than 106 setups. Peptide Classification, Johannes Söllner 01/02/2019

Decision Trees Peptide Classification, Johannes Söllner 01/02/2019

IBk (k-nearest neighbour or lazy learning) attribute 2 attribute 1 Peptide Classification, Johannes Söllner 01/02/2019

The Confusion Matrix The rates of True and False classifications have to be taken into consideration. One way is the representation in a confusion matrix. P postives (e.g. epitopes) N negatives (e.g. non-epitopes) predictions Known classes P N TP FN FP TN TP True Positive FP False Positives TN True Negatives FN False Negatives Peptide Classification, Johannes Söllner 01/02/2019

Finding Optimal Classifiers Can be used to obtain an ROC (Receiver Operator Characteristic) graph Can be used to find best setups by means of the ROC convex hull Peptide Classification, Johannes Söllner 01/02/2019

ROC Convex Hull All used setups in an ROC graph. The green curve forms the ROC convex hull. Classifiers forming the hull are by definition optimal decreasing conservativity Peptide Classification, Johannes Söllner 01/02/2019

Classifiers 17 Classifiers selected from 1164041 Were combined by voting (ensemble learning) Validation on an independent dataset Comparison to the gold standard Peptide Classification, Johannes Söllner 01/02/2019

Validation For method validation, a dataset of overlapping peptides from 7 published proteins1 was assembled  4923 peptides of determined antigenicity (1HIV gp160, S. Mutans SpaP, HCV E1, HCV E2, N. meningitidis p64K, SARS E2, HHV-5 UL32 aa490-660) All our classifiers are above the first meridian The gold standard by Kolaskar and Tongaonkar is not above the first meridian Peptide Classification, Johannes Söllner 01/02/2019

HIV gp160 – ensemble classifiers Number of selections sec. structure gold standard Peptide Classification, Johannes Söllner 01/02/2019

AIP data Increasing conservativity Peptide Classification, Johannes Söllner 01/02/2019

Summary Compared new and published parameters Determined optimal single parameters Selected parameter sets for machine learning Evaluated two machine learning strategies Neighbourhood parameters can be useful for antigenicity analysis BLOSUM62 based alphabets are similarly powerful as ureduced alphabets Contact Energy, Hydrophobicity, beta-turn, beta-sheet, accessible surface area and van der Waals potential Machine learning classifiers are consistenly better than the gold standard Peptide Classification, Johannes Söllner 01/02/2019

Acknowledgements Intercell Emergentec Bernd Mayer Martin Hafner Martin Willensdorfer Martin Hafner Jörg Fritz Alexander von Gabain Biovertis Uwe von Ahsen University of Leipzig Peter F. Stadler Peptide Classification, Johannes Söllner 01/02/2019