A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data Jacek Biesiada Division of Computer Methods, Dept. of Electrotechnology, The Silesian.

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Aggregating local image descriptors into compact codes

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.

R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Minimum Redundancy and Maximum Relevance Feature Selection

Computational Intelligence for Information Selection

Mutual Information Mathematical Biology Seminar

Selekcja informacji dla analizy danych z mikromacierzy Włodzisław Duch, Jacek Biesiada Dept. of Informatics, Nicolaus Copernicus University, Google: Duch.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Reduced Support Vector Machine

Ensemble Learning: An Introduction

Evaluating Hypotheses

Lecture 5 (Classification with Decision Trees)

Three kinds of learning

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Feature selection based on information theory, consistency and separability indices Włodzisław Duch, Tomasz Winiarski, Krzysztof Grąbczewski, Jacek Biesiada,

Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.

Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.

Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.

CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,

Additive Data Perturbation: the Basic Problem and Techniques.

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.

Unsupervised Learning

Computational Intelligence: Methods and Applications

Presented by Jingting Zeng 11/26/2007

Instance Based Learning

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Trees, bagging, boosting, and stacking

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Computational Intelligence: Methods and Applications

Roberto Battiti, Mauro Brunato

Feature Selection To avid “curse of dimensionality”

Data Transformations targeted at minimizing experimental variance

Chapter 7: Transformations

Feature Selection Methods

FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR

Unsupervised Learning

Presentation transcript:

A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data Jacek Biesiada Division of Computer Methods, Dept. of Electrotechnology, The Silesian University of Technology, Katowice, Poland. Włodzisław Duch Dept. of Informatics, Nicolaus Copernicus University, Google: Duch ICONIP 2007

MotivationMotivation Attention: is a basic cognitive skill, without focus on relevant information cognition would not be possible.Attention: is a basic cognitive skill, without focus on relevant information cognition would not be possible. In natural perception (vision, auditory scenes, tactile signals) large number of features may be dynamically selected depending on the task.In natural perception (vision, auditory scenes, tactile signals) large number of features may be dynamically selected depending on the task. Large feature spaces: (genes, proteins, chemistry, etc): different features are relevant.Large feature spaces: (genes, proteins, chemistry, etc): different features are relevant. Filters will leave large number of potentially relevant features.Filters will leave large number of potentially relevant features. Redundancy should be removed!Redundancy should be removed! Fast filters with removal of redundancy are needed!Fast filters with removal of redundancy are needed! Microarrays: popular testing ground, although not reliable due to small number of samples.Microarrays: popular testing ground, although not reliable due to small number of samples. Goal: fast filter + redundancy removal + tests on microarray data to identify problems.Goal: fast filter + redundancy removal + tests on microarray data to identify problems.

Microarray matrices Genes in rows, samples in columns, DNA/RNA type

Selection of information Find relevant information:Find relevant information: –discard attributes that do not contain information. –use weights to express the relative importance. –create new, more informative attributes. –reduce dimensionality aggregating information. Ranking: treat each feature as independent.Ranking: treat each feature as independent. Selection: search for subsets, remove redundant.Selection: search for subsets, remove redundant. Filters: universal criteria, model-independent.Filters: universal criteria, model-independent. Wrappers: criteria specific for data models are used.Wrappers: criteria specific for data models are used. Frapper: filter + wrapper in the final stage.Frapper: filter + wrapper in the final stage. Redfilapper: redundancy removal + filter + wrapper.Redfilapper: redundancy removal + filter + wrapper. Create fast redfilapper.Create fast redfilapper.

Filters & Wrappers Filter approach for data D : define your problem C, for example assignment of class labels; define an index of relevance for each feature J i =J(X i )=J(X i |D,C) calculate relevance indices for all features and order J i1  J i2 .. J id remove all features with relevance below threshold J(X i ) < t R Wrapper approach: select predictor P and performance measure J(D|X)=P(Data|X). define search scheme: forward, backward or mixed selection. evaluate starting subset of features X s, ex. single best or all features add/remove feature X i, accept new set X s  {X s +X i } if P(Data|X s +X i )>P(Data|X s )

Information gain Information gained by considering the joint probability distribution p(C, f) is a difference between: A feature is more important if its information gain is larger. Modifications of the information gain, used as criteria in some decision trees, include: IGR(C,X j ) = IG(C,X j )/I(X j ) the gain ratio IGn(C,X j ) = IG(C,X j )/I(C) an asymmetric dependency coefficient D M (C,X j ) =  IG(C,X j )/I(C,X j ) normalized Mantaras distance

Information indices Information gained considering attribute X j and classes C together is also known as ‘mutual information’, equal to the Kullback-Leibler divergence between joint and product probability distributions: Entropy distance measure is a sum of conditional information: Symmetrical uncertainty coefficient is obtained from entropy distance:

Purity indices Many information-based quantities may be used to evaluate attributes. Consistency or purity-based indices are one alternative. For selection of subset of attributes F={X i } the sum runs over all Cartesian products, or multidimensional partitions r k (F). Advantages: simplest approach both ranking and selection Hashing techniques are used to calculate p(r k (F)) probabilities.

Correlation coefficient Perhaps the simplest index is based on the Pearson’s correlation coefficient (CC) that calculates expectation values for product of feature values and class values: For feature values that are linearly dependent correlation coefficient is  or  ; for complete independence of class and X j distribution CC= 0. How significant are small correlations? It depends on the number of samples n. The answer (see Numerical Recipes is given by: For n=1000 even small CC=0.02 gives P ~ 0.5, but for n=10 such CC gives only P ~ 0.05.

F-score Mutual information is based on Kullback-Leibler distance, any distance measure between distributions may also be used, ex. Jeffreys-Matusita with pooled variance calculated from For two classes F = t 2 or t-score. Many other such (dis)similarity measures exist. Which is the best? In practice they all are similar, although accuracy of calculation of indices is important; relevance indices should be insensitive to noise and unbiased in their treatment of features with many values.

State-of-the-art methods 1. FCBF, Fast Correlation-Based Filter (Yu & Liu 2003). Compare feature-class J i =SU(X i,C) and feature-feature SU(X i,X j ); rank features J i1 ≥ J i2 ≥ J i3... ≥ J im ≥ min threshold. Compare feature X i to all X j lower in ranking, if SU(X i, X j ) ≥ SU(C,X i ) then X i is redundant and is removed. 2.ConnSF, Consistency features selection (Dash, Liu, Motoda 2000). Inconsistency J I (S) for discrete valued feature S is J I (S) = n − n(C). where a subset of features S with values V S appears n times in the data, most often n(C) times with the label of class C. Total inconsistency count = sum of all the inconsistency counts for all distinct patterns of a feature subsets S. Consistency = the least inconsistency count. 3. CorrSF (Hall 1999), based on correlation coefficient with 5 step backtracking.

Kolmogorov-Smirnov test Are distributions of values of two different features roughly equal? If yes, one is redundant. Discretization process creates k clusters (vectors from roughly the same class), each typically covering similar range of values. A much larger number of independent observation n 1, n 2 > 40 are taken from the two distributions, measuring frequencies of different classes. Based on the frequency table the empirical cumulative distribution functions F 1i and F 2i are constructed. λ(K-S statistics) is proportional to the largest absolute difference of |F 1i − F 2i |, and if λ < λ α distributions are equal:

KS-CBS Kolmogorov-Smirnov Correlation-Based Selection algorithm. Relevance analysis 1.Order features according to the decreasing values of relevance indices creating S list. Redundancy analysis 1.Initialize F i to the first feature in the S list. 2.Use K-S test to find and remove from S all features for which F i forms an approximate redundant cover C(F i ). 3.Move F i to the set of selected features, take as F i the next remaining feature in the list. 4.Repeat step 3 and 4 until the end of the S list.

3 Datasets Leukemia: training 38 bone marrow samples (27 of the ALL and 11 of the AML type), using 7129 probes from 6817 human genes; 34 test samples are provided, with 20 ALL and 14 AML cases. Too small for such split, Colon Tumor: 62 samples collected from colon cancer patients, with 40 biopsies from tumor areas (labelled as “negative") and 22 from healthy parts of the colons of the same patients out of around 6500 genes were pre-selected, based on the confidence in the measured expression levels. Diffuse Large B-cell Lymphoma [DLBCL]: two distinct types of diffuse large lymphoma B-cells (most common subtype of non- Hodgkin’s lymphoma); 47 samples, 24 from “germinal centre B-like" group, 23 are from “activated B-like" group, 4026 genes.

Discretization & classifiers For comparison of information selection techniques simple discretization of gene expression levels into 3 intervals is used. Variance σ, mean μ, discrete values -1, 0, +1 for ( ,μ − σ/2), [μ − σ/2, μ + σ/2 ], (μ+ σ/2,  ) Represents under-expression, baseline and over-expression of genes. Results after such discretization are in some cases significantly improved and are given in parenthesis in the tables below. Classifiers used: C4.5 decision tree (Weka), Naive Bayes with single Gaussian kernel, or discretized prob., k-NN, or 1 nearest neighbor algorithm (Ghostminer implementation) Linear SVM with C = 1 (also GM)

No. of features selected For standard  =0.05 confidence level for redundancy rejection relatively large number of features is left for Leukemia. Even for  =0.001 confidence level 47 features are left; best to optimize it by wrapper. A larger number of feature may lead to more reliable “profile” (ex. by chance one gene in Leukemia gets 100% on training). Large improvements up to 30% in accuracy, with small number of samples statistical significance is ~5%. Discretization improves results in most cases.

Results

More results

Leukemia: Bayes rules Top: test, bottom: train; green = p(C|X) for Gaussian-smoothed density with  =0.01, 0.02, 0.05, 0.20 (Zyxin).

Leukemia SVM LVO Problems with stability

Leukemia boosting 3 best genes, evaluation using bootstrap.

ConclusionsConclusions K-S CBS algorithm combines relevance indices (F-measure, SUC or other index) to rank and reduce the number of features, and uses Kolmogorov-Smirnov test to reduce the number of features further.K-S CBS algorithm combines relevance indices (F-measure, SUC or other index) to rank and reduce the number of features, and uses Kolmogorov-Smirnov test to reduce the number of features further. It is computationally efficient and gives quite good results.It is computationally efficient and gives quite good results. Variants of this algorithm may identify approximate redundant covers for consecutive features X i and leave in the S set only the one that gives best results.Variants of this algorithm may identify approximate redundant covers for consecutive features X i and leave in the S set only the one that gives best results. Problems with stability of solutions for small and large data! no significant difference between many feature selection methods.Problems with stability of solutions for small and large data! no significant difference between many feature selection methods. Frapper selects on training those that are helpful in O(m) steps, stabilizes LOO results a bit, but it is not a complete solution. Will anything work reliably for microarray feature selection? Are results published so far worth anything?