Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read “Rule-Based Data Mining Methods for Classification.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining and Machine Learning
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Indian Statistical Institute Kolkata
Automatic Speech Recognition II  Hidden Markov Models  Neural Network.
Chapter 7 – Classification and Regression Trees
Discriminative and generative methods for bags of features
Machine Learning Neural Networks
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Sparse vs. Ensemble Approaches to Supervised Learning
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Machine Learning: Ensemble Methods
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Crash Course on Machine Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Issues with Data Mining
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
COMP3503 Intro to Inductive Modeling
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Chapter 9 – Classification and Regression Trees
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Benk Erika Kelemen Zsolt
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Learning from Observations Chapter 18 Through
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Solving the Fragmentation Problem of Decision Trees by Discovering Boundary Emerging Patterns Jinyan Li and Limsoon Wong Speaker: Sarah Chan CSIS DB Seminar.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
DECISION TREES An internal node represents a test on an attribute.
Hidden Markov Models Part 2: Algorithms
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
A task of induction to find patterns
A task of induction to find patterns
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read “Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains”, a tutorial at PKDD04 by Jinyan Li and Limsoon Wong, September CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 29 January 2005

Course Plan

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Outline Overview of Supervised Learning Decision Trees Ensembles –Bagging –Boosting –Random forest –Randomization trees –CS4 Other Methods –K-Nearest Neighbour –Support Vector Machines –Bayesian Approach –Hidden Markov Models –Artificial Neural Networks

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Overview of Supervised Learning

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Computational Supervised Learning Also called classification Learn from past experience, and use the learned knowledge to classify new data Knowledge learned by intelligent algorithms Examples: –Clinical diagnosis for patients –Cell type classification

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Data Classification application involves > 1 class of data. E.g., –Normal vs disease cells for a diagnosis problem Training data is a set of instances (samples, points) with known class labels Test data is a set of instances whose class labels are to be predicted

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Typical Notations Training data {  x 1, y 1 ,  x 2, y 2 , …,  x m, y m  } where x j are n-dimensional vectors and y j are from a discrete space Y. E.g., Y = {normal, disease}. Test data {  u 1, ? ,  u 2, ? , …,  u k, ? , }

Training data: X Class labels Y f(X) A classifier, a mapping, a hypothesis Test data: U Predicted class labels f(U) Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Process

x 11 x 12 x 13 x 14 … x 1n x 21 x 22 x 23 x 24 … x 2n x 31 x 32 x 33 x 34 … x 3n …………………………………. x m1 x m2 x m3 x m4 … x mn n features (order of 1000) m samples class PNPNPNPN gene 1 gene 2 gene 3 gene 4 … gene n Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Relational Representation of Gene Expression Data

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Features Also called attributes Categorical features –color = {red, blue, green} Continuous or numerical features –gene expression –age –blood pressure Discretization

An Example Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Biomedical Financial Government Scientific Decision trees Emerging patterns SVM Neural networks Classifiers (Medical Doctors) Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Overall Picture of Supervised Learning

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Evaluation of a Classifier Performance on independent blind test data K-fold cross validation: Given a dataset, divide it into k even parts, k-1 of them are used for training, and the rest one part treated as test data LOOCV, a special case of K-fold CV Accuracy, error rate False positive rate, false negative rate, sensitivity, specificity, precision

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Requirements of Biomedical Classification High accuracy/sensitivity/specificity/precision High comprehensibility

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Importance of Rule-Based Methods Systematic selection of a small number of features used for the decision making  Increase the comprehensibility of the knowledge patterns C4.5 and CART are two commonly used rule induction algorithms---a.k.a. decision tree induction algorithms

AB A BA x1x1 x2x2 x4x4 x3x3 > a 1 > a 2 Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Structure of Decision Trees If x 1 > a 1 & x 2 > a 2, then it’s A class C4.5, CART, two of the most widely used Easy interpretation, but accuracy generally unattractive Leaf nodes Internal nodes Root node B A

Elegance of Decision Trees A B BA A Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

CLS (Hunt et al. 1966)--- cost drivenID3 (Quinlan, 1986) --- Information-driven C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas CART (Breiman et al. 1984) --- Gini Index Brief History of Decision Trees

9 Play samples 5 Don’t A total of 14. A Simple Dataset Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

2 outlook windy humidity Play Don’t sunny overcast rain <= 75 > 75 false true Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong A Decision Tree Construction of a tree is equivalent to determination of the root node of the tree and the root node of its sub- trees

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Most Discriminatory Feature Every feature can be used to partition the training data If the partitions contain a pure class of training instances, then this feature is most discriminatory

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Example of Partitions Categorical feature –Number of partitions of the training data is equal to the number of values of this feature Numerical feature –Two partitions

OutlookTempHumidityWindy class Sunny7570truePlay Sunny8090 trueDon’t Sunny8585 falseDon’t Sunny 7295trueDon’t Sunny6970falsePlay Overcast7290truePlay Overcast8378falsePlay Overcast6465truePlay Overcast8175falsePlay Rain7180trueDon’t Rain6570trueDon’t Rain 7580false Play Rain6880false Play Rain7096falsePlay Instance # Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Total 14 training instances 1,2,3,4,5 P,D,D,D,P 6,7,8,9 P,P,P,P 10,11,12,13,14 D, D, P, P, P Outlook = sunny Outlook = overcast Outlook = rain Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Total 14 training instances 5,8,11,13,14 P,P, D, P, P 1,2,3,4,6,7,9,10,12 P,D,D,D,P,P,P,D,P Temperature <= 70 Temperature > 70 Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Steps of Decision Tree Construction Select the “best” feature as the root node of the whole tree After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree Recursively, until the partitions become pure or almost pure

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Three Measures to Evaluate Which Feature is Best Gini index Information gain Information gain ratio

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Gini Index

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Information Gain

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Information Gain Ratio

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Missing many globally significant rules; mislead the system Characteristics of C4.5 Trees Single coverage of training data (elegance) Divide-and-conquer splitting strategy Fragmentation problem  Locally reliable but globally insignificant rules

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Example Use of Decision Tree Methods: Proteomics Approaches to Biomarker Discovery In prostate and bladder cancers (Adam et al. Proteomics, 2001) In serum samples to detect breast cancer (Zhang et al. Clinical Chemistry, 2002) In serum samples to detect ovarian cancer (Petricoin et al. Lancet; Li & Rao, PAKDD 2004).

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Decision Tree Ensembles Bagging Boosting Random forest Randomization trees CS4

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong h 1, h 2, h 3 are indep classifiers w/ accuracy = 60% C 1, C 2 are the only classes t is a test instance in C 1 h(t) = argmax C  {C1,C2} |{h j  {h 1, h 2, h 3 } | h j (t) = C}| Then prob(h(t) = C 1 ) = prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 2 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 2 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 2 & h 2 (t)=C 1 & h 3 (t)=C 1 ) = 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60% = 64.8% Motivating Example

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Bagging Proposed by Breiman (1996) Also called Bootstrap aggregating Make use of randomness injected to training data

50 p + 50 n Original training set 48 p + 52 n 49 p + 51 n 53 p + 47 n … A base inducer such as C4.5 A committee H of classifiers: h 1 h 2 …. h k Main Ideas Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Draw 100 samples with replacement

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Decision Making by Bagging Given a new test sample T

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Boosting AdaBoost by Freund & Schapire (1995) Also called Adaptive Boosting Make use of weighted instances and weighted voting

Main Ideas Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong 100 instances with equal weight A classifier h 1 error If error is 0 or >0.5 stop Otherwise re-weight correctly classified instances by: e1/(1-e1) Renormalize total weight to instances with different weights A classifier h 2 error Samples misclassified by h i

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Given a new test sample T Decision Making by AdaBoost.M1

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Bagging vs Boosting Bagging –Construction of Bagging classifiers are independent –Equal voting Boosting –Construction of a new Boosting classifier depends on the performance of its previous classifier, i.e. sequential construction (a series of classifiers) –Weighted voting

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Random Forest Proposed by Breiman (2001) Similar to Bagging, but the base inducer is not the standard C4.5 Make use twice of randomness

50 p + 50 n Original training set 48 p + 52 n 49 p + 51 n 53 p + 47 n … A base inducer (not C4.5 but revised) A committee H of classifiers: h 1 h 2 …. h k Main Ideas Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Root node Original n number of features Selection is from m try number of randomly chosen features A Revised C4.5 as Base Classifier Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Decision Making by Random Forest Given a new test sample T

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Randomization Trees Proposed by Dietterich (2000) Make use of randomness in the selection of the best split point

Root node Original n number of features Select one randomly from {feature 1: choice 1,2,3 feature 2: choice 1, 2,. feature 8: choice 1, 2, 3 } Total 20 candidates Equal voting on the committee of such decision trees Main Ideas Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

CS4 Proposed by Li et al (2003) CS4: Cascading and Sharing for decision trees Doesn’t make use of randomness

Selection of root nodes is in a cascading manner! 1 2 k tree-1 tree-2 tree-k total k trees root nodes Main Ideas Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Not equal voting Decision Making by CS4 Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Bagging Random Forest AdaBoost.M1 Randomization Trees CS4 Rules may not be correct when applied to training data Rules correct Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Summary of Ensemble Classifiers

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Any Question?

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Other Machine Learning Approaches K-Nearest Neighbour Support Vector Machine Bayesian Approach Hidden Markov Model Artificial Neural Networks

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Outline K-Nearest Neighbour Support Vector Machines Bayesian Approach Hidden Markov Models Artificial Neural Networks

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong K-Nearest Neighbours

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong A common “distance” measure betw samples x and y is where f ranges over features of the samples How kNN Works Given a new case Find k “nearest” neighbours, i.e., k most similar points in the training data set Assign new case to the same class to which most of these neighbours belong

Neighborhood 5 of class 3 of class = Illustration of kNN (k=8) Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Image credit: Zaki

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Some Issues Simple to implement But need to compare new case against all training cases  May be slow during prediction No need to train But need to design distance measure properly  may need expert for this Can’t explain prediction outcome  Can’t provide a model of the data

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Example Use of kNN: Segmentation of White Lesion Matter in MRI Anbeek et al, NeuroImage 21: , 2004 Use kNN to automated segmentation of white matter lesions in cranial MR images Rely on info from T1- weighted, inversion recovery, proton density- weighted, T2-weighted, & fluid attenuation inversion recovery scans

Example Use of kNN: Ovarian Cancer Diagnosis Based on SELDI Proteomic Data Li et al, Bioinformatics 20: , 2004 Use kNN to diagnose ovarian cancers using proteomic spectra Data set is from Petricoin et al., Lancet 359: , 2002 Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Enzyme inducers Peroxisome proliferators Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Example Use of kNN: Prediction of Compound Signature Based on Gene Expression Profiles Hamadeh et al, Toxicological Sciences 67: , 2002 Store gene expression profiles corr to biological responses to exposures to known compounds whose toxicological and pathological endpoints are well characterized use kNN to infer effects of unknown compound based on gene expr profiles induced by it

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Support Vector Machines

(a) Linear separation not possible w/o errors (b) Better separation by nonlinear surfaces in input space (c ) Nonlinear surface corr to linear surface in feature space. Map from input to feature space by “kernel” function   “Linear learning machine” + kernel function as classifier Basic Idea Image credit: Zien Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Hyperplane separating the x’s and o’s points is given by (WX) + b = 0, with (WX) =  j W[j]*X[j]  Decision function is llm(X) = sign((WX) + b)) Linear Learning Machines Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Solution is a linear combination of training points X k with labels Y k W[j] =  k  k *Y k *X k [j], with  k > 0, and Y k = ±1  llm(X) = sign(  k  k *Y k * (X k X) + b) Linear Learning Machines “data” appears only in dot product!

llm(X) = sign(  k  k *Y k * (X k X) + b) svm(X) = sign(  k  k *Y k * (  X k  X) + b)  svm(X) = sign(  k  k *Y k * K(X k,X) + b) where K(X k,X) = (  X k  X) Kernel Function Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Kernel Function svm(X) = sign(  k  k *Y k * K(X k,X) + b)  K(A,B) can be computed w/o computing  In fact replace it w/ lots of more “powerful” kernels besides (A B). E.g., –K(A,B) = (A B) d –K(A,B) = exp(– || A B|| 2 / (2*  )),...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong How SVM Works svm(X) = sign(  k  k *Y k * K(X k,X) + b) To find  k is a quadratic programming problem max:  k  k – 0.5 *  k  h  k *  h Y k *Y h *K(X k,X h ) subject to:  k  k *Y k =0 and for all  k, C   k  0 To find b, estimate by averaging Y h –  k  k *Y k * K(X h,X k ) for all  h  0

Example Use of SVM: Prediction of Protein- Protein Interaction Sites From Sequences Koike et al, Protein Engineering Design & Selection 17: , 2004 Identification of protein- protein interaction sites is impt for mutant design & prediction of protein- protein networks Interaction sites were predicted here using SVM & profiles of sequentially/spatially neighbouring residues Legend: green=TP, white=TN, yellow=FN, red=FP A : human macrophage migration inhibitory factor B & C: the binding proteins Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Example Use of SVM: Prediction of Gene Function From Gene Expression Brown et al., PNAS 91: , 2000 Use SVM to identify sets of genes w/ a c’mon function based on their expression profiles Use SVM to predict functional roles of uncharacterized yeast ORFs based on their expression profiles Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Example Use of SVM: Recognition of Protein Translation Initiation Sites Zien et al., Bioinformatics 16: , 2000 Use SVM to recognize protein translation initiation sites from genomic sequences Raw data set is same as Liu & Wong, JBCB 1: , 2003 TIS Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Bayesian Approach

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Bayes Theorem P(h) = prior prob that hypothesis h holds P(d|h) = prob of observing data d given h holds P(h|d) = posterior prob that h holds given observed data d

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Let H be all possible classes. Given a test instance w/ feature vector {f 1 = v 1, …, f n = v n }, the most probable classification is given by Using Bayes Theorem, rewrites to Since denominator is indep of h j, simplifies to Bayesian Approach

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Naïve Bayes But estimating P(f 1 =v 1, …, f n =v n |h j ) accurately may not be feasible unless training data set is sufficiently large “Solved” by assuming f 1, …, f n are indep Then where P(h j ) and P(f i =v i |h j ) can often be estimated reliably from typical training data set

Example Use of Bayesian: Design of Screens for Macromolecular Crystallization Hennessy et al., Acta Cryst D56: , 2000 Xtallization of proteins requires search of expt settings to find right conditions for diffraction- quality xtals BMCD is a db of known xtallization conditions Use Bayes to determine prob of success of a set of expt conditions based on BMCD Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Hidden Markov Models

How HMM Works HMM is a stochastic generative model for sequences Defined by –finite set of states S –finite alphabet A –transition prob matrix T –emission prob matrix E Move from state to state according to T while emitting symbols according to E sksk s1s1 … s2s2 a1a1 a2a2 Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

In nth order HMM, T & E depend on all n previous states E.g., for 1st order HMM, given emissions X = x 1, x 2, …, & states S = s 1, s 2, …, the prob of this seq is If seq of emissions X is given, use Viterbi algo to get seq of states S such that S = argmax S Prob(X, S) If emissions unknown, use Baum-Welch algo How HMM Works

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Game: –You bet $1 –You roll –Casino rolls –Highest number wins $2 Question: Suppose we played 2 games, and the sequence of rolls was 1, 6, 2, 6. Were we likely to be cheated? Example: Dishonest Casino Casino has two dices: –Fair dice P(i) = 1/6, i = 1..6 –Loaded dice P(i) = 1/10, i = 1..5 P(i) = 1/2, i = 6 Casino switches betw fair & loaded die with prob 1/2. Initially, dice is always fair

“Visualization” of Dishonest Casino Copyright © 2004 by Jinyan Li and Limsoon Wong

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong 1, 6, 2, 6? We were probably cheated...

Example Use of HMM: Protein Families Modelling Baldi et al., PNAS 91: , 1994 HMM is used to model families of biological sequences, such as kinases, globins, & immunoglobulins Bateman et al., NAR 32:D138-D141, 2004 HMM is used to model 6190 families of protein domains in Pfam Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Example Use of HMM: Gene Finding in Bacterial Genomes Borodovsky et al., NAR 23: , 1995 Investigated statistical features of 3 classes (wrt level of codon usage bias) of E. coli genes HMM for nucleotide sequences of each class was developed Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Artificial Neural Networks

What are ANNs? ANNs are highly connected networks of “neural computing elements” that have ability to respond to input stimuli and learn to adapt to the environment... Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Computing Element Behaves as a monotone function y = f(net), where net is cummulative input stimuli to the neuron net is usually defined as weighted sum of inputs f is usually a sigmoid Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

How ANN Works Computing elements are connected into layers in a network The network is used for classification as follows: –Inputs x i are fed into input layer –each computing element produces its corr output –which are fed as inputs to next layer, and so on –until outputs are produced at output layer What makes ANN works is how the weights on the links are learned Usually achieved using “back propagation” Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

v ij wjwj zjzj Back Propagation v ji = weight on link betw x i and jth computing element in 1st layer w j be weight of link betw jth computing element in 1st layer and computing element in last layer z j = output of jth computing element in 1st layer Then Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Back Propagation For given sample, y may differ from target output t by amt  Need to propagate this error backwards by adjusting weights in proportion to the error gradient For math convenience, define the squared error as  To find an expression for weight adjustment, we differentiate E wrt v ij and w j to obtain error gradients for these weights v ij wjwj zjzj Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Applying chain rule a few times and recalling definitions of y, z j, E, and f, we derive...

Back Propagation v ij wjwj zjzj Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Example Use of ANN: T-Cell Epitopes Prediction Honeyman et al., Nature Biotechnology 16: , 1998 Use ANN to predict candidate T-cell epitopes Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Any Question?

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong References L. Breiman, et al. Classification and Regression Trees. Wadsworth and Brooks, 1984 L. Breiman, “Bagging predictors”. Machine Learning, 24: , 1996 L. Breiman, “Random forests”. Machine Learning, 45:5- 32, 2001 J. R. Quinlan, “Induction of decision trees”. Machine Learning, 1: , 1986 J. R. Quinlan, C4.5: Program for Machine Learning. Morgan Kaufmann, 1993 C. Gini, “Measurement of inequality of incomes”, The Economic Journal, 31: , 1921

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong References Y. Freund, et al. “Experiments with a new boosting algorithm”. ICML 1996, pages T. G. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization”. Machine Learning, 40: , 2000 J. Li, et al. “Ensembles of cascading trees”. ICDM 2003, pages

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong 2-4pm every Friday at LT33