Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Slides:

Advertisements

Similar presentations

Sequence motifs, information content, logos, and HMM’s

Advertisements

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Hidden Markov models) Morten.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Project in Immunological Bioinformatics Morten Nielsen, CBS, BioCentrum, DTU.

Cross validation, training and evaluation of data driven prediction methods Morten Nielsen Department of Systems Biology, DTU.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,

Gibbs sampling Morten Nielsen, CBS, BioSys, DTU. Class II MHC binding MHC class II binds peptides in the class II antigen presentation pathway Binds peptides.

Biological Databases Morten Nielsen BioSys, DTU. Different kinds of data DNA –NCBI GenBankNCBI GenBank –Organism specific databases Protein –UniProt SwissProt.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.

Predicting peptide MHC interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU.

Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Artificial Neural Networks 2 Morten Nielsen BioSys, DTU.

Optimization methods Morten Nielsen Department of Systems Biology, DTU.

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

Algorithms in Bioinformatics Morten Nielsen BioSys, DTU.

Biological sequence analysis and information processing by artificial neural networks.

Class I pathway Prediction of proteasomal cleavage and TAP binding.

Artificial Neural Networks 2 Morten Nielsen Depertment of Systems Biology, DTU.

Protein Secondary Structures

Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.

Introduction to Pattern Recognition Prediction in Bioinformatics What do we want to predict? –Features from sequence –Data mining How can we predict? –Homology.

Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.

Performance measures Morten Nielsen, CBS, BioCentrum, DTU.

Class I pathway Prediction of proteasomal cleavage and TAP binidng Morten Nielsen, CBS, BioCentrum, DTU.

Class I pathway Prediction of proteasomal cleavage and TAP binidng Can Keşmir, TBB, Utrecht University, NL & CBS, BioCentrum, DTU.

Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen.

Protein Fold recognition

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

Biological sequence analysis and information processing by artificial neural networks.

Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.

Evaluating Classifiers

What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe...

Evaluation – next steps

Protein Sequence Alignment and Database Searching.

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)

Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.

What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.

Identification of Helix-Turn-Helix (HTH) DNA-Binding Motifs

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.

Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology, DTU.

Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.

Blosum matrices What are they? Morten Nielsen BioSys, DTU

Convolutional LSTM Networks for Subcellular Localization of Proteins

Psi-Blast Morten Nielsen, Department of systems biology, DTU.

Optimization methods Morten Nielsen Department of Systems biology, DTU IIB-INTECH, UNSAM, Argentina.

Stabilization matrix method (Ridge regression) Morten Nielsen Department of Systems Biology, DTU.

Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.

Performance measures Morten Nielsen, CBS, Department of Systems Biology, DTU.

An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.

Math 1320 Chapter 6: Sets and Counting 6.4 Permutations and Combinations.

Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.

Sequence motifs, information content, logos, and HMM’s

Predicting peptide MHC interactions

Prediction of RNA Binding Protein Using Machine Learning Technique

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Ectopic pregnancy diagnosis and the pseudo-sac

Morten Nielsen, CBS, BioSys, DTU

Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen

Amino Acids An amino acid is any compound that contains an amino group (—NH2) and a carboxyl group (—COOH) in the same molecule.

Outline Basic Local Alignment Search Tool

Sequence Alignment Algorithms Morten Nielsen BioSys, DTU

Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen

Volume 27, Issue 7, Pages e5 (July 2019)

Algorithms in Bioinformatics

Presentation transcript:

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Outline Sequence encoding Overfitting Method evaluation How to represent biological data Overfitting cross-validation Method evaluation

Encoding of sequence data Sequence encoding Encoding of sequence data Sparse encoding Blosum encoding Sequence profile encoding Reduced amino acid alphabets

Sparse encoding Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AAcid A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

BLOSUM encoding (Blosum50 matrix) A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Sequence encoding (continued) Sparse encoding V:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 L:0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 V.L=0 (unrelated) Blosum encoding V: 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 L:-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 V.L = 0.88 (highly related) V.R = -0.08 (close to unrelated)

Sequence encoding (continued) Each amino acids is encoded by 20 variables This might be highly ineffective Can this number be reduced without losing predictive performance? Use reduced amino acid alphabet Charge, volume, hydrophobicity, Chemical descriptors Appealing, but in my experience it does not work -)

Evaluation of predictive performance A prediction method contains a very large set of parameters A matrix for predicting binding for 9meric peptides has 9x20=180 weights Over fitting is a problem Temperature years

Evaluation of predictive performance Train PSSM on raw data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders

Evaluation of predictive performance Train PSSM on Permuted data No pseudo counts, No sequence weighting Fit 9*20 parameters to 9*10 data points Evaluate on training data PCC = 0.97 AUC = 1.0 Close to a perfect prediction method AND Same performance as one the original data AAAMAAKLA AAKNLAAAA AKALAAAAR AAAAKLATA ALAKAVAAA IPELMRTNG FIMGVFTGL NVTKVVAWL LEPLNLVLK VAVIVSVPF MRSGRVHAV VRFNIDETP ANYIGQDGL AELCGDPGD QTRAVADGK GRPVPAAHP MTAQWWLDA FARGVVHVI LQRELTRLQ AVAEEMTKS Binders None Binders

Repeat on large training data (229 ligands)

Cross validation Train on 4/5 of data Test/evaluate on 1/5 => Produce 5 different methods each with a different prediction focus

Method evaluation Use cross validation Evaluate on concatenated data and not as an average over each cross-validated performance And even better, use an external evaluation set, that is not part of the training data

Method evaluation How is an external evaluation set evaluated on a 5 fold cross-validated training? The cross-validation generates 5 individual methods Predict the evaluation set on each of the 5 methods and take average

Method evaluation

Method evaluation

5 fold training Which PSSM to choose?

5 fold training. Use them all

Define with encoding scheme to use Summary Define with encoding scheme to use Sparse, Blosum, reduced alphabet, and combinations of these Deal with over-fitting Evaluate method using cross-validation Evaluate external data using method ensemble