San Sebastián Meeting May, 2004 1 ELVIRA II San Sebastián Meeting May, 2004 Andrés Masegosa.

Slides:



Advertisements
Similar presentations
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Model generalization Test error Bias, variance and complexity
Minimum Redundancy and Maximum Relevance Feature Selection
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
What is Statistical Modeling
Visual Recognition Tutorial
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Mutual Information Mathematical Biology Seminar
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Differentially expressed genes
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Discrimination Methods As Used In Gene Array Analysis.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Alizadeh et. al. (2000) Stephen Ayers 12/2/01. Clustering “Clustering is finding a natural grouping in a set of data, so that samples within a cluster.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Chapter 11 Multiple Regression.
Visual Recognition Tutorial
Experimental Evaluation
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Today Concepts underlying inferential statistics
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.2 Estimating Differences.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
CS Statistical Machine learning Lecture 24
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Machine Learning 5. Parametric Methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Unsupervised Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
12 Inferential Analysis.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
12 Inferential Analysis.
Generally Discriminant Analysis
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: UNIT-3 CHAPTER-1
Unsupervised Learning
Presentation transcript:

San Sebastián Meeting May, ELVIRA II San Sebastián Meeting May, 2004 Andrés Masegosa

San Sebastián Meeting May, Naive-Bayes classifier for gene expression data Classifier with Continuous Variables Feature Selection Wrapper Search Method

San Sebastián Meeting May, Index 1. The Naive-Bayes Classifier 1.1 Hipotesis for the creation of the NB classfier. 1.2 Description of a NB classifier. 2. Previous Works 2.1 The lymphoma data set 2.2 Wright et al. paper 3. Selective Gaussian Naive-Bayes 3.1 Anova Phase 3.2 Search Phase 3.3 Stop Condition 4. Implementation 4.1 Implemented Classes 4.2 Methods 4.3 Results 5. Conclusions 6. Future Works

San Sebastián Meeting May, Linear Exponential Mixtures given the class variable: Javi Hipothesis 2: The attribute variables are distributed as: Normal Distribution given the class variable: Andrés Drawbacks It has constraints. It can’t have discrete childs. Advantages The model is simple Hypothesis for the creation of the Naive-Bayes classifier Hipothesis 1: “The attribute variables are independent given the class variable”. It hasn’t constraints. It could have discrete childs. It’s possible exact propagation. It has better adjustement possibilities to general distribution. Advantages Drawbacks The model is more complex

San Sebastián Meeting May, Naive-Bayes Classifier There are three basic steps in the construction of the Naive– Bayes classifier: Step 1 : Structural Learning  It’s learned the classifier structure. The naive-Bayes model has only arcs from the class variable to the predictives variables, it’s assumed that the predicitives variables are independent given the class. Step 2 : Parametric Learning  It’s consists in estimating the distribition for each predictive variable. Step 3 : Propagation Method  It’s carries out the prediction of the class variable given the predictives variables. In our case, the known predictive variables are observed in the bayesian network and after a propagation method (Variable Elmination) is runned to get an á posteriori’ distribution of the class variable. The class with the greatest probability is the predicted value.

San Sebastián Meeting May, Navi-Bayes with MTE It’s learned a MTE for each predictive variable given the class. An example: Estimating a Normal distribution with a MTE NB classifier learned from Iris data base

San Sebastián Meeting May, Lymphoma data set I Alizadeth et al (2000) Hierarchical clustering of gene expression data. Depicted are the,1.8 million measurements of gene expression from 128 microarray analyses of 96 samples of normal and malignant lymphocytes. The dendrogram at the left lists the samples studied and provides a measure of the relatedness of gene expression in each sample. The dendrogram is colour coded according to the category of mRNA sample studied (see upper right key). Each row represents a separate cDNA clone on the microarray and each column a separate mRNA sample. The results presented represent the ratio of hybridization of fluorescent cDNA probes prepared from each experimental mRNA samples to a reference mRNA sample. These ratios are a measure of relative gene expression in each experimental sample and were depicted according to the colour scale shown at the bottom. As indicated, the scale extends from ¯uorescence ratios of 0.25 to 4 (-2 to +2 in log base 2 units). Grey indicates missing or excluded data.

San Sebastián Meeting May, Lymphoma data set II Alizadeh et al (2000): It’s proposed a partition of the diffuse large B-cell lymphoma cases in two clusters by the gene expression profiling: Germinal Centre B-like: High survival rate. Activated B-like: Low survival rate. Rosenwald et al (2002): It’s proposed a new partition of the diffuse large B-cell lymphoma cases in three clusters (274 patients): Germinal Centre B-like (GCB): High survival rate (134 patients). Activated B-cell (ABC): Low survival rate ( 83 patients). Type 3 (TypeIII): Medium survival rate (57 patients). Wright et al (2003): It’s proposed a Bayesian predictor that estimates the probability of memebership in one of two cancer subgroups (GCB or ABC), with data set of Rosenewald et al.

San Sebastián Meeting May, Gene Expression Data: genes. 134 cases of GCB, 83 cases of ABC and 57 cases of Type III. DLBC subgroup predictor: Linear Predictor Score: LPS(X)=X = (X 1, X 2,...., X n ) Only K genes with the most significant t statistics were used to form the LPS, the optimal k was determined by a leave one out method. A model including 27 genes had the lowest average error rate.  where N(x, ,  ) represents a Normal density function with mean  and desviation . Training set: 67 GCB + 42 ABC. Validation set: 67 GCB + 41 ABC + 57 Type III. Wright et al (2003)

San Sebastián Meeting May, Wright et al (2003). This predictor choses a cutoff of 90% certainty. The samples for which there was <90% probability of being in either subgroup are termed ‘unclassified’. Results:

San Sebastián Meeting May, Index 1. The Naive Bayes Classifier 1.1 Hipotesis for the creation of the NB classfier. 1.2 Description of a NB classifier. 2. Previous Works 2.1 The lymphoma data set 2.2 Wright et al paper 3. Selective Gaussian Naive-Bayes 3.1 Anova Phase 3.2 Search Phase 3.3 Stop Condition 3.4 The M best Explanations 4. Implementation 4.1 Implemented Classes 4.2 Methods 4.3 Results 5. Conclusions 6. Future Works

San Sebastián Meeting May, Selective Gaussian Naive Bayes It’s a modified wrapper method to construct an optimal Naive Bayes classifier with a minimum number of predictive genes. The main steps of the algorithm are: Step 1: First feature selection procedure. Selection of the most not correlated significant variables (Anova phase). Step 2: Application of a wrapper search method to select the final subset of variables that minimizes the training error rate (Search Phase). Step 3: Learning of the distribution for each selected gene (Parametric Phase).

San Sebastián Meeting May, Anova Phase: Analisys of Variance. A dispersion measurement is established for each gene  Anova(X)  [0,+  [. The gene set is preordered from higher to lower Anova measurement. The gene set is partitioned in K gene subsets where each gene pair of a subset has a correlation coefficient greater than a given bound, U. For each gene subset, the variable with the greatest anova coefficient is selected. So, we get only k variables of the initial training set. Space of the genes Selected gene Cluster of genes

San Sebastián Meeting May, Search Phase: A wrapper method Preliminary: Let A(m,n) the original training set, with m distinct cases and n features for each case. Leat B(m,k), the projection of A over the k selected genes, the K set in the previous phase. B(m,k) =  (B(m,n),K). Let KFC(B(m,k)), the error rate obtained with a simple Naive-Bayes classifier evaluation over B by using a T-fold-cross-validation procedure. Algorithm: Let P=  and Q={X 1,...., X k }. While (not stopCondition(#(P),r,  r) ) Sea i=indMin{KFC(  ( B(m,k), P  {X j })): X j  Q} r=KFC(B(m,k),P) P=P  {X i }, Q=Q/{X i }.  r=KFC(B(m,k),P) – r.

San Sebastián Meeting May, Search Phase: Stop Condition The parameters are: #(P), number of elements of P; r, actual error rate;  r, increment error rate). General stop condition:  r  0 or  r>0. Problems: Early stopping (only 3-5 genes are selected) with  r  0. OverFitting with  r>0. Implemented stop condition: Avoid overfitting OR Avoid early stopping #(P) r 0 1 False True #(P) rr 0 1 False True

San Sebastián Meeting May, The M best explanations. Abductive Inference Due to the high dimensionality of the gene expression data sets, it’s usually use cross validation methods to estimate the train error rate of a classification model. If a T-fold-cross method is used, T final gene subsets are got by the wrapper search method. The question is: how do I select a unique gene subset to apply to the validation data set?. Method: Let C i, i  {1,..., T}, the subset returned by the wrapper method in the phase i of the cross validation procedure. Let C=  C i y N = #(C) Let Z a data base of T cases, where the case ‘j’ is a tuple : {a 1,..., a N }, with a i = 1, if X i  C i ; a i =0,otherwise. Let B a BN learned by a K2 learning method. An abductive inference method returns the M most probablity explanations of the BN that is equivalent to get the M most probable gene subsets. The final subset is the subset with minimum leaving one out training error rate.

San Sebastián Meeting May, Implementation I Included in the ‘learning.classification.supervised.mixed’ package. This package contains the follow classes: MixedClassifer class  It’s a abstract class. It was designed to be the parent of all the mixed classifiers to implement. It inherits from DiscreteClassifier class. Mixed_Naive_Bayes class  It’s a public class to learn a mixed Naive-Bayes classfication model. It inherits form MixedClassifer class. This class implements the structural learning and the selective structural learning methods. This last method contains the implementation of our wrapper search method and it needs to define the following methods (to be implemented later): double evaluationKFC(DataBaseCases cases, int numclass). double evaluationLOO (DataBaseCases cases, int numclass). Bnet getNewClassifier(DataBaseCases cases).

San Sebastián Meeting May, Implementation II Gaussian_Naive_Bayes class  It’s a public class that implements the parametric learning of a mixed NB classifer. It’s assumed that the predictive variables are ditributed as a normal distribution given the class. It inherits from Mixed_Naive_Bayes class. Selective_GNB class  It’s a public class that implements a gaussian naive bayes with feature selection. So, this class: Implements the Selective_Classifier interface. Overwrites the following methods: structuralLearnig, now this method call to selectiveStructuralLearning metod. And evaluationKFC, evaluationLOO, getNewClassifier methods. Selective_Classifier interface  It’s a public interface for define variable selection in a classifier method.

San Sebastián Meeting May, Methods 10 training and validation sets were randomly generated. The three phases were applied to each training set. The parameters were: 10 fold cross validation. M was fixed to 20. U was fixed to This predictor chose a cutoff of 80% certainty. The samples for which there was <80% likelihood of being in either subgroup were termed ‘unclassified’. The stop condition was implemented as: Avoid overfitting: r > n*u2/n2 ; u2=0.03 and n2=20. Avoid earlier stopping: incRate < (n1-n)*u1/n1 ; u1=0.1, n1=10.

San Sebastián Meeting May, Results I Phase Anova: (Confidence Intervals at 95%) Size (gene number): [74.3, 83.1] Train accuracy rate (%): [96.8, 98.6] Test accuracy rate (%): [92.8, 95.4] Test -log likelihood: [41.6, 72.3] TypeIII Test accuracy rate (%): [17.75, 18.66] TypeIII Test -log likelihood: ‘Infinity’ Model Prediction DLBCL subgroup Training setValidation set

San Sebastián Meeting May, Results II Phase Anova + Phase Search: (Confidence Intervals at 95%) Size (gene number): [6.17, 7.82] Train accuracy rate (%): [95.2, 98.0] Test accuracy rate (%): [88.83, 91.9] Test -log likelihood: [26.72, 40.46] TypeIII Test accuracy rate (%): [20.0, 26.2] TypeIII Test -log likelihood: [214.0, 264.6] Model Prediction DLBCL subgroup Training setValidation set

San Sebastián Meeting May, Conclusions It’s a simple classification method that provides good results. Its main problem is that, due to the search process, the train error rate goes down quickly and the mean number of selected genes is too low (around eight genes). Altough this trend is corrected by the anova phase, the k-fold- cross validation and the flexible stop condition. Get the M best explanations is a very good technique to fuse several groups of extracted genes by a feature selection method.

San Sebastián Meeting May, Future works Develop more sophisticated models: Include replacement variables to manage losen data. Consider Multidimensionals Gaussian distributions. Improve the MTE Gaussian Naive Bayes model. Apply this model to other data sets as breast cancer, colon cancer... Compare with other models with discrete variables.