School of Pharmacy Medical University of Sofia

Slides:



Advertisements
Similar presentations
T-cell epitope prediction by molecular dynamics simulations Irini Doytchinova Medical University of Sofia School of Pharmacy Medical University of Sofia.
Advertisements

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Indian Statistical Institute Kolkata
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc.
Structural bioinformatics
Classification and risk prediction
Mining frequent patterns in protein structures: A study of protease families Dr. Charles Yan CS6890 (Section 001) ST: Bioinformatics The Machine Learning.
1 Computational Analysis of Protein-DNA Interactions Changhui (Charles) Yan Department of Computer Science Utah State University.
The Protein Data Bank (PDB)
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Napovedovanje imunskega odziva iz peptidnih mikromrež Mitja Luštrek 1 (2), Peter Lorenz 2, Felix Steinbeck 2, Georg Füllen 2, Hans-Jürgen Thiesen 2 1 Odsek.
Protein Tertiary Structure Prediction
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Protein Sequence Alignment and Database Searching.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Tailored vaccines – fantasy or reality? Irini Doytchinova Medical University of Sofia School of Pharmacy, Medical University of Sofia.
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel:
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Experience Report: System Log Analysis for Anomaly Detection
Machine Learning – Classification David Fenyő
Results for all features Results for the reduced set of features
Pfizer HTS Machine Learning Algorithms: November 2002
SMA5422: Special Topics in Biotechnology
Evaluating classifiers for disease gene discovery
Introduction Feature Extraction Discussions Conclusions Results
Prediction of RNA Binding Protein Using Machine Learning Technique
Machine Learning Week 1.
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Sequence Based Analysis Tutorial
Thesis Defense.
Homology Modeling.
Protein structure prediction.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Evaluating Classifiers for Disease Gene Discovery
Presentation transcript:

School of Pharmacy Medical University of Sofia Application of machine learning techniques for allergenicity prediction Ivan Dimitrov 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011

Allergen processing pathways Allergy is a form of hypersensitivity to normally innocuous substances as dust, pollen, foods or drugs. Allergens are small antigens that commonly provoke an IgE antibody response. Such antigens normally enter the body at very low doses by diffusion across mucosal surfaces and trigger a Th2 response. The allergen-specific Th2 cells drive allergen-specific B cells to produce IgE, which binds to the high-affinity surface receptor, called FcεRI, on mast cells, basophils and activated eosinophils. On activation, these cells release stored mediators, which cause inflammation and tissue damage manifested by different symptoms. Inhalant allergens cause rhinitis, conjunctivitis and asthmatic symptoms, while food allergens lead to abdominal pain, bloating, vomiting and diarrhea. Food allergens rarely cause respiratory reactions and inhalant allergens rarely affect the gut (Rusznak and Davies, 1998, Wiki). C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283

FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins A query protein is potentially allergenic if it: has an identity of 6 to 8 contiguous amino acids or has > 35% sequence similarity over a window of 80 amino acids Although there is no consensus allergen structure, FAO and WHO have produce Codex alimentarius guidelines for evaluating potential allergenicity for any novel protein. According to these guidelines, a query protein is potentially allergenic if it either has an identity of 6 to 8 contiguous amino acids or >35% sequence similarity over a window of 80 amino acids when compared with known allergens. when compared with known allergens. Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization.

Bioinformatics approaches to allergen prediction Sequence-alignment search of query protein Extensive databases of known allergen proteins and the FAO/WHO guidelines - Structural Database of Allergenic Proteins - Allermatch Characteristics: High sensitivity (true positives/(true positives + false negatives)) - Produce many false positives and low precision (true positives/(true positives + false positives)) - Discovery of novel antigens is restricted by their lack of similarity to known allergens. Nowadays two bioinformatics approaches exist to deal with allergen prediction. The first approach follows FAO/WHO guidelines and searches for sequence similarity. Structural Database of Allergenic Proteins (SDAP) and Allermatch and contain extensive databases of known allergen proteins and use them as references in sequence-alignment search of query protein. These methods characterize with high sensitivity, but produce many false positives and low precision. Besides, discovery of novel antigens is restricted by their lack of similarity to known allergens Ivanciuc et al. Nucleic Acids Res. 2003, 31, 359–362 Fiers et al. BMC Bioinformatics 2004, 5, 133

Bioinformatics approaches to allergen prediction 2. Identification of conserved allergenicity-related linear motifs Comparing allergens to non-allergens by MEME motif discovery tool - Clustering of known allergens, wavelet analysis and hidden Markov model - Automated Selection of Allergen-Representative Peptides (DASARP). Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP) - Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine Both approaches are based on the assumption that the allergenicity is a linearly coded property. The second approach is based on identification of conserved allergenicity – related linear motifs. These methods use different techniques for identification, representation and analysis of allergenicity – related motifs. Both approaches are based on the assumption that the allergenicity is a linearly coded property. Stadler and Stadler FASEB J. 2003, 17, 1141-1143 Saha and Raghava Nucleic Acids Research,2006,34, 202-209 Li et al. Bioinformatics 2004, 20, 2572-2578. Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Björklund et al. Bioinformatics. 2005, 21, 39–50

Allergens are proteins with different length. AIM of the study To create an alignment-free method for in silico identification of allergens based on the main chemical properties of amino acid sequences and implement it to a web server. Obstacles: The choice of an appropriate descriptors to represent the physicochemical properties of amino acid sequences. Our aim was to create an alignment-free method for in silico identification of allergens based on the main chemical properties of amino acid sequences and implement it to a web server. The main obstacles in this case are: the choice of an appropriate descriptors to represent the physicochemical properties of amino acid sequences and the different length of allergens Allergens are proteins with different length.

hydrophobicity molecular size polarity The z-scales The principal properties of the amino acids were represented by z descriptors, originally derived by Hellberg et al. [14] to describe amino acid hydrophobicity, molecular size and polarity. These scales were derived by PCA (principal component analysis) of a data matrix consisting of 29 physico-chemical variables, such as molecular weight, pKa's, 13C NMR-shifts, etc. These z-scales reflect the most important properties of amino acids and are therefore often referred to as the "principal properties" of amino acids. With the three z-scales it is possible to numerically quantify the structural variation within a series of related peptides, by arranging the z-scales according to the amino acid sequence …Phe – Arg – Trp… z1 z2 z3 hydrophobicity molecular size polarity z1 z2 z3 z1 z2 z3 z1 z2 z3 -4.22 1.94 1.08 3.62 2.60 -3.60 -4.36 3.94 0.69 Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135

ACC transformation Auto-covariance Cross-covariance j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence; Phe – Arg – Trp – Phe – Arg – Trp protein z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 The auto cross covariance (ACC) transformation turns the protein sequences into uniform equal-length vectors. ACC is an protein sequence mining method developed by Wold et al., which has been applied to quantitative structure-activity relationships (QSAR) studies of peptides with different length and for protein classification. The ACC transformation accounts for neighbour effects, i.e. the lack of independence between different sequence positions by lag variable. In the equations index j refers to the z-descriptors (j = 1-3), n is the number of amino acids in a sequence, index i points the amino acid position (i = 1, 2, …, n) and lag is the lag (l = 1, 2, …, L). In our study short lags (lag= 5) have been chosen as only the influence of the close amino acid proximity was investigated. /5 ACC11(1) z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 ACC13(1) Wold et al. Anal. Chim. Acta 1993, 277:239-225

matrix with 45 variables (32 x 5) Preliminary study 595 food allergens from CSL allergen database 595 non-allergens from NCBI database Training set 475 food allergens 475 non-allergens Test set 120 food allergens 120 non-allergens ACC transformation of z descriptors matrix with 45 variables (32 x 5) and 950 observations external validation A set of 595 food allergens was collected from the CSL (Central Science Laboratory) allergen database (http://allergen.csl.gov.uk) . A corresponding (from the same species) set of 595 non-allergens was collected from NCBI database (http://www.ncbi.nlm.nih.gov/). A training set of 475 allergens and 475 non-allergens, based on equal representation of all species in the initial set was formed. The amino acid sequences were represented by z descriptors and a matrix of 45 variables and 950 observation was formed after ACC transformation. We applied different machine learning methods on that matrix and validated the corresponding models on a external set. It consists of 120 allergens and 120 non-allergens. PLS discriminant analysis was performed by SIMCA software. K nearest neighbours algorithm was performed by a Python script based on a BioPython module. Logistic regression, Naïve – Bayes and decision tree algorithms were performed by Orange visualization and analysis tool. The results are evaluated using Sensitivity, Specificity and Accuracy of the corresponding method. statistical methods, machine learning Sensitivity Specificity Accuracy PLS - discriminant analysis Logistic regression Naïve - Bayes algorithm Decision tree algorithm k Nearest Neighbours http://allergen.csl.gov.uk http://www.ncbi.nlm.nih.gov/

Results from preliminary study TP – true positive, FP – false positive TN – true negative, FN – false negative Comparison of the methods shows best results for K nearest neighbours at K=5. All of the methods have some imbalance in sensitivity and specificity but for PLS-DA it is significant. The most homogeneous results according to specificity and sensitivity is observed for K nearest neighbour algorithm. The difference in specificity and sensitivity for all of the methods supposes the need of further improving of the training set.

Web servers on the test set Algpred   - SVM with single aa composition - SVM with dipeptide composition Evaller APPEL Allerhunter Test set 120 food allergens 120 non-allergens Sensitivity Specificity Accuracy We tested the performance of the available web servers on our testset and compared them to our best result KNN(5). All of the servers use support vector machines (SVM) as a machine learning method and different kind of methods for peptide representation. The comparison of the results shows imbalance in sensitivity and specificity for almost all of the servers. The servers with the most homogeneous values for specificity and sensitivity are actually the one with the best performance. Highest results among the servers is achieved by Allerhunter: 87% sensitivity,92% specificity and 89.9% accuracy. Saha and Raghava Nucleic Acids Research,2006,34, 202-209. Barrio et al., Nucleic Acids Research 2007, 35, 694-700 http://jing.cz3.nus.edu.sg/cgi-bin/APPEL Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861

Conclusions from the preliminary study The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study. 2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements. 1.The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study. 2.The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements. 3.A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too. 3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too.

The kNN algorithm Training set 475 allergens, 475 non-allergens Unknown protein ACC transformation of z descriptors ACC transformation of z descriptors vector with 45 variables (32 x 5) matrix of 45 variables (32 x 5) and 950 observations Calculate the Euclidian distance between the vector and each observation The protein sequences of the training set, containing 475 allergens and 475 non-allergens are presented through vectors of z descriptors. The vectors formed are subjected to ACC transformation, which turns the training set into a matrix with 45 variables and 950 observations. Every protein from the testset is represented by z descriptors and transformed to a vector with 45 ACC values. The Euclidian distance between the vector of unknown protein and all of the 950 observations is calculated and the obtained values are sorted in ascending order. The K nearest neighbours are the K observations with the least value and the class of the tested protein is the class of the majority of the neighbours. Sort the distance by value in ascending order Determine the class of unknown allergen according to the majority of nearest neighbours Determine the k nearest neighbours

Next: Extend the data sets CSL allergen database, FARRP allergen database SDAP database, ADFS database 684 food, 1157 inhalant, 553 toxins, venom or salivary allergens Allergen species NCBI database Create local database We extract data for food and inhalant allergens from four databases and use allergen species from the resulting sets to collect local database with protein records of all allergen species (max record for species 1000). From this local database we blast proteins against a collected set of all allergens (food and inhalant) to form a set of non-allergen with no sequence similarity but from the same species. The result was two data sets with 684 food allergens and 684 non-allergens from the same species and 1157 inhalant allergens and the same number non allergens from the same species. Proteins from allergen species Blasts search against all allergens 684 non-allergen from food origin 1157 non-allergens from inhalant origin 553 non-allergens from species with toxins, venom or salivary allergens http://allergen.csl.gov.uk http://www.allergenonline.org/ http://fermi.utmb.edu/SDAP/ http://allergen.nihs.go.jp/ADFS/index.jsp http://www.ncbi.nlm.nih.gov/

Next: kNN optimization 684 food allergens 684 non-allergens Training set 528 allergens 528 non-allergens Test set 156 allergens 156 non-allergens machine learning external validation k nearest neighbours We use the set of food allergens and non-allergens to optimize the kNN algorithm, which showed best performance among all the machine learning methods. The set with food allergens was divided to training set of 528 allergens and corresponding non-allergens and a test set of 156 allergens and corresponding non allergens. KNN models with different K values were trained and tested to find the best K value for that set. Increasing the value of K lead to a slight increase in specificity, but sensitivity decreased significantly. As a result there were reduce in accuracy with increasing of K. Best results for accuracy was achieved for K=3 although most homogeneous result with respect to all of the tree parameters was achieved for K=5 and K=7. Sensitivity Specificity Accuracy

kNN models Sensitivity Specificity Accuracy 684 food allergens 684 non-allergens 1157 inhalant allergens 1157 non-allergens Test set 156 allergens 156 non-allergens Training set 528 allergens 528 non-allergens Training set 933 allergens 933 non-allergens Test set 224 allergens 224 non-allergens external validation external validation external validation k NN k = 3 k NN k = 3 Each of the sets with food and inhalant allergens and non-allergens was divided to training and test set. The training sets of food and inhalant allergens were used for creating KNN models with K=3 since it had best performance during optimisation step. The models for food and inhalant allergens were validated with the respective test sets and with the whole set of inhalant and food allergens respectively. Sensitivity Specificity Accuracy

kNN models The results show that while the test set has not significant effect on the specificity, the sensitivity depends clearly on it. Both of the models shows high specificity i.e. both models recognizes non-allergen correctly (almost 90%). The lower values for sensitivity when the models are tested on sets consisted of different kind of allergens corresponds with the data in literature that food allergens rarely cause respiratory reactions and inhalant allergens rarely affect the gut. The highest results for all of the three parameters was achieved by the kNN model trained with food allergens and validated with food test set. The model based on the aggregated training set shows good performance and its values for all of the parameters: sensitivity, specificity and accuracy are very close.

AllerTOP web tool for allergenicity prediction Training set 1952 food, inhalant and others allergens and 1952 non-allergens ACC transformation of z descriptors kNN model external validation We implement the the KNN model based on aggregated training set with food, inhalant and others allergens in a web tool for online prediction of allergens. The server takes protein sequence in single letter format, transforms it to a vector with 45 ACC values and gives the output of the model for the tested protein. AllerTOP http://www.pharmfac.net/alletop

Servers performance on united testset United test set of 441 food and inhalant allergens and 441 non-allergens The performance of the servers on aggregated testset consisted of 441 allergens and 441 non-allergens is presented. Unfortunately, two of the servers from preliminary studies: Appel and Evaller were not available during recent study. The highest results was achieved by Allerhunter and AlgPred server Allergen representing peptide method. The former even reached 100% specificity. The KNN model based on aggregated training set with 1952 allergens shows very stable results for specificity and sensitivity but this is not enough to reach to highest scores. Two of the servers from preliminary studies: Appel and Evaller were not available during recent study. The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids)

Conclusions An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed. 2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors. 3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm. 4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 1. An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed. 2.The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors. 3.The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes algorithm. 4.The kNN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on: http://www.pharmfac.net/allertop 5. The kNN algorithm was implemented on a web server, freely available on: http://www.pharmfac.net/allertop

Drug Design Group School of Pharmacy Medical University of Sofia Irini Doytchinova Ivan Dimitrov Mariyana Atanasova Panaiot Garnev Acknowledgements Darren R. Flower Aston University, Birmingham, UK Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009