Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Slides:



Advertisements
Similar presentations
ALGEBRA II HONORS/GIFTED - SECTION 10-6
Advertisements

Ordinary Least-Squares
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Neural Network training Morten Nielsen, CBS, BioCentrum, DTU.
EE-M /7: IS L7&8 1/24, v3.0 Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons Dr Martin Brown Room: E1k
NEURAL NETWORKS Backpropagation Algorithm
EE 690 Design of Embodied Intelligence
Machine Learning and Data Mining Linear regression
Edge Preserving Image Restoration using L1 norm
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Gibbs sampling Morten Nielsen, CBS, BioSys, DTU. Class II MHC binding MHC class II binds peptides in the class II antigen presentation pathway Binds peptides.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
COMP 116: Introduction to Scientific Programming Lecture 11: Linear Regression.
Artificial Neural Networks 2 Morten Nielsen BioSys, DTU.
Optimization methods Morten Nielsen Department of Systems Biology, DTU.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Optimization methods Morten Nielsen Department of Systems biology, DTU.
Algorithms in Bioinformatics Morten Nielsen BioSys, DTU.
Biological sequence analysis and information processing by artificial neural networks.
Artificial Neural Networks 2 Morten Nielsen Depertment of Systems Biology, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Linear Discriminant x2x2 x1x1 class 1 class 2 g 1 (x|w 1 )= w 10 + w 11 x 1 + w 12 x 2 e.g.: 1 - 5x 1 + 4x 2 (5,7) (8,7)
Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.
Tracking using the Kalman Filter. Point Tracking Estimate the location of a given point along a sequence of images. (x 0,y 0 ) (x n,y n )
Goals of Adaptive Signal Processing Design algorithms that learn from training data Algorithms must have good properties: attain good solutions, simple.
Biological sequence analysis and information processing by artificial neural networks.
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Equivalent Fractions Equivalent Fractions Using Cross-Multiplication.
Collaborative Filtering Matrix Factorization Approach
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Algorithms in Bioinformatics Morten Nielsen Department of Systems Biology, DTU.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
A Menu-Driven Application Yonglei Tao. Problem Statement  Develop a script program that allows the user to create and maintain an directory, called.
Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology, DTU.
Machine Learning 5. Parametric Methods.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Optimization methods Morten Nielsen Department of Systems biology, DTU IIB-INTECH, UNSAM, Argentina.
Stabilization matrix method (Ridge regression) Morten Nielsen Department of Systems Biology, DTU.
Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.
Performance measures Morten Nielsen, CBS, Department of Systems Biology, DTU.
Project Organization Organize the directory structure for the project at the beginning if possible Use consistent file extensions:.scr C shell script.c.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Algebra 2cc Section 2.10 Identify and evaluate polynomials A polynomial function is an expression in the form: f(x) = ax n + bx n-1 + cx n-2 + … dx + e.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Linear Algebra Curve Fitting. Last Class: Curve Fitting.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
CSE 4705 Artificial Intelligence
Collaborative Filtering Matrix Factorization Approach
J.-F. Pâris University of Houston
Least Squares Fitting A mathematical procedure for finding the best-fitting curve to a given set of points by minimizing the sum of the squares of the.
Kinematics The Component Method.
Morten Nielsen, CBS, BioSys, DTU
The Bias Variance Tradeoff and Regularization
CALCULATING EQUATION OF LEAST SQUARES REGRESSION LINE
Presentation transcript:

Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU

A prediction method contains a very large set of parameters –A matrix for predicting binding for 9meric peptides has 9x20=180 weights Over fitting is a problem Data driven method training years Temperature

Model over-fitting (early stopping) Evaluate on 600 MHC:peptide binding data PCC=0.89 Stop training

Stabilization matrix method The mathematics y = ax + b 2 parameter model Good description, poor fit y = ax 6 +bx 5 +cx 4 +dx 3 +ex 2 +fx+g 7 parameter model Poor description, good fit

Stabilization matrix method The mathematics y = ax + b 2 parameter model Good description, poor fit y = ax 6 +bx 5 +cx 4 +dx 3 +ex 2 +fx+g 7 parameter model Poor description, good fit

SMM training Evaluate on 600 MHC:peptide binding data L=0: PCC=0.70 L=0.1 PCC = 0.78

Stabilization matrix method. The analytic solution Each peptide is represented as 9*20 number (180) H is a stack of such vectors of 180 values t is the target value (the measured binding) is a parameter introduced to suppress the effect of noise in the experimental data and lower the effect of overfitting

SMM - Stabilization matrix method I1I1 I2I2 w1w1 w2w2 Linear function o Sum over weights Sum over data points

SMM - Stabilization matrix method I1I1 I2I2 w1w1 w2w2 Linear function o Per target error: Global error: Sum over weights Sum over data points

SMM - Stabilization matrix method Do it yourself I1I1 I2I2 w1w1 w2w2 Linear function o per target

And now you

SMM - Stabilization matrix method I1I1 I2I2 w1w1 w2w2 Linear function o per target

SMM - Stabilization matrix method I1I1 I2I2 w1w1 w2w2 Linear function o

SMM - Stabilization matrix method Monte Carlo I1I1 I2I2 w1w1 w2w2 Linear function o Global: Make random change to weights Calculate change in “global” error Update weights if MC move is accepted Note difference between MC and GD in the use of “global” versus “per target” error

Training/evaluation procedure Define method Select data Deal with data redundancy –In method (sequence weighting) –In data (Hobohm) Deal with over-fitting either –in method (SMM regulation term) or –in training (stop fitting on test set performance) Evaluate method using cross-validation

A small doit tcsh script /home/projects/mniel/ALGO/code/SMM #! /bin/tcsh -f set DATADIR = /home/projects/mniel/ALGO/data/SMM/ foreach a ( A0101 A3002 ) mkdir -p $a cd $a foreach n ( ) mkdir -p n.$n cd n.$n cat $DATADIR/$a/f00$n | gawk '{print $1,$2}' > f00$n cat $DATADIR/$a/c00$n | gawk '{print $1,$2}' > c00$n touch $$.perf rm -f $$.perf foreach l ( ) smm -nc 500 -l $l f00$n > mat.$l pep2score -mat mat.$l c00$n > c00$n.pred echo $l `cat c00$n.pred | grep -v "#" | args 2,3 | gawk 'BEGIN{n+0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}' ` >> $$.perf end set best_l = `cat $$.perf | sort -nk2 | head -1 | gawk '{print $1}'` echo "Allele" $a "n" $n "best_l" $best_l pep2score -mat mat.$best_l c00$n > c00$n.pred cd.. end echo $a `cat n.?/c00?.pred | grep -v "#" | args 2,3 | xycorr | grep -v "#"` \ `cat n.?/c00?.pred | grep -v "#" | args 2,3 | \ gawk 'BEGIN{n+0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}' ` cd.. end