Selekcja informacji dla analizy danych z mikromacierzy Włodzisław Duch, Jacek Biesiada Dept. of Informatics, Nicolaus Copernicus University, Google: Duch.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Pattern Recognition and Machine Learning
Minimum Redundancy and Maximum Relevance Feature Selection
Lecture 3 Nonparametric density estimation and classification
Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.
CMPUT 466/551 Principal Source: CMU
Computational Intelligence for Information Selection
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Mutual Information Mathematical Biology Seminar
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
Ensemble Learning: An Introduction
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Evaluating Hypotheses
Mutual Information for Image Registration and Feature Selection
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Feature selection based on information theory, consistency and separability indices Włodzisław Duch, Tomasz Winiarski, Krzysztof Grąbczewski, Jacek Biesiada,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Lecture II-2: Probability Review
Ensemble Learning (2), Tree and Forest
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
A Kolmogorov-Smirnov Correlation-Based Filter for Microarray Data Jacek Biesiada Division of Computer Methods, Dept. of Electrotechnology, The Silesian.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Chapter 7. Classification and Prediction
Computational Intelligence: Methods and Applications
Instance Based Learning
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Data preprocessing and transformation
Computational Intelligence: Methods and Applications
Roberto Battiti, Mauro Brunato
K Nearest Neighbor Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Image Registration 박성진.
Feature space tansformation methods
Data Transformations targeted at minimizing experimental variance
Parametric Methods Berlin Chen, 2005 References:
Chapter 7: Transformations
Nonparametric density estimation and classification
Presentation transcript:

Selekcja informacji dla analizy danych z mikromacierzy Włodzisław Duch, Jacek Biesiada Dept. of Informatics, Nicolaus Copernicus University, Google: Duch BIT 2007

MikromacierzeMikromacierze Sondy wykrywające przez hybrydyzację komplementarne DNA/RNA

Selection of information Attention: basic cognitive skill, focus, constrained activationsAttention: basic cognitive skill, focus, constrained activations Find relevant information:Find relevant information: –discard attributes that do not contain information, –use weights to express the relative importance, –create new, more informative attributes –reduce dimensionality aggregating information Ranking: treat each feature as independent.Ranking: treat each feature as independent. Selection: search for subsets, remove redundant.Selection: search for subsets, remove redundant. Filters: universal criteria, model-independent.Filters: universal criteria, model-independent. Wrappers: criteria specific for data models are used.Wrappers: criteria specific for data models are used. Frapper: filter + wrapper in the final stage.Frapper: filter + wrapper in the final stage.

Filters & Wrappers Filter approach: define your problem, for example assignment of class labels; define an index of relevance for each feature R(X i ) rank features according to their relevance R(X i1 )  R(X i2 ) .. R(X id ) rank all features with relevance above threshold R(X i1 )  t R Wrapper approach: select predictor P and performance measure P(Data) define search scheme: forward, backward or mixed selection evaluate starting subset of features X s, ex. single best or all features add/remove feature X i, accept new set X s  {X s +X i } if P(Data| X s +X i )>P(Data| X s )

Information theory - filters X – vectors, X j – attributes, X j =f attribute values, C i - class i =1.. K, joint probability distribution p(C, X j ). The amount of information contained in this joint distribution, summed over all classes, gives an estimation of feature importance: For continuous attribute values integrals are approximated by sums. This implies discretization into r k (f) regions, an issue in itself. Alternative: fitting p(C i,f) density using Gaussian or other kernels. Which method is more accurate and what are expected errors?

Information gain Information gained by considering the joint probability distribution p(C, f) is a difference between: A feature is more important if its information gain is larger. Modifications of the information gain, frequently used as criteria in decision trees, include: IGR(C,X j ) = IG(C,X j )/I(X j ) the gain ratio IGn(C,X j ) = IG(C,X j )/I(C) an asymmetric dependency coefficient D M (C,X j ) =  IG(C,X j )/I(C,X j ) normalized Mantaras distance

Information indices Information gained considering attribute X j and classes C together is also known as ‘mutual information’, equal to the Kullback-Leibler divergence between joint and product probability distributions: Entropy distance measure is a sum of conditional information: Symmetrical uncertainty coefficient is obtained from entropy distance:

Purity indices Many information-based quantities may be used to evaluate attributes. Consistency or purity-based indices are one alternative. For selection of subset of attributes F={X i } the sum runs over all Cartesian products, or multidimensional partitions r k (F). Advantages: simplest approach both ranking and selection Hashing techniques are used to calculate p(r k (F)) probabilities.

Correlation coefficient Perhaps the simplest index is based on the Pearson’s correlation coefficient (CC) that calculates expectation values for product of feature values and class values: For feature values that are linearly dependent correlation coefficient is  or  ; for complete independence of class and X j distribution CC= 0. How significant are small correlations? It depends on the number of samples n. The answer (see Numerical Recipes is given by: For n=1000 even small CC=0.02 gives P ~ 0.5, but for n=10 such CC gives only P ~ 0.05.

Other relevance indices Mutual information is based on Kullback-Leibler distance, any distance measure between distributions may also be used, ex. Jeffreys-Matusita Bayesian concentration measure is simply: Many other such dissimilarity measures exist. Which is the best? In practice they all are similar, although accuracy of calculation of indices is important; relevance indices should be insensitive to noise and unbiased in their treatment of features with many values; IT is fine.

Discretization All indices of feature relevance require summation over probability distributions. What to do if the feature is continuous? There are two solutions: 2.Discretize, put feature values in some bins, and calculate sums. Histograms with equal width of bins (intervals); Histograms with equal number of samples per bin; Maxdiff histograms: bins starting in the middle (x i+1  x i )/2 of largest gaps V-optimal: sum of variances in bins should be minimal (good but costly). 1. Fit some functions to histogram distributions using Parzen windows, ex. fit a sum of several Gaussians, and integrate to get indices, ex:

Discretized information With partition of X j feature values x into r k bins, joint information is calculated as: and mutual information as:

Tree (entropy-based) discretization V-opt histograms are good, but difficult to create (dynamic programming techniques should be used). Simple approach: use decision trees with a single feature, or a small subset of features, to find good splits – this avoids local discretization. Example: C4.5 decision tree discretization maximizing information gain, or SSV tree based separability criterion, vs. constant width bins. Hypothyroid screening data, 5 continuous features, MI shown. Compare the amount of mutual information or the correlation between class labels and discretized values of features using equal width discretization and decision tree discretization into 4, bins.

Ranking algorithms Based on: MI(C,f) : mutual information (information gain) ICR(C|f) : information gain ratio IC(C|f) : information from max C posterior distribution GD(C,f) : transinformation matrix with Mahalanobis distance JBC(C,f) : Bayesian purity index + 7 other methods based on IC and correlation-based distances Relieff selection methods. Several new simple indices; estimation errors and convergence. Markov blankets Boosting Margins

Selection algorithms Maximize evaluation criterion for single & remove redundant features. 1. MI(C;f)  MI(f;g) algorithm 2. IC(C,f)  IC(f,g), same algorithm but with IC criterion 3. Max IC(C;F) adding single attribute that maximizes IC 4. Max MI(C;F) adding single attribute that maximizes IC 5. SSV decision tree based on separability criterion.

4 Gaussians in 8D Artificial data: set of 4 Gaussians in 8D, 1000 points per Gaussian, each as a separate class. Dimension 1-4, independent, Gaussians centered at: (0,0,0,0), (2,1,0.5,0.25), (4,2,1,0.5), (6,3,1.5,0.75). Ranking and overlapping strength are inversely related: Ranking: X 1  X 2  X 3  X 4. Attributes X i+4 = 2X i + uniform noise ±1.5. Best ranking: X 1  X 5  X 2  X 6  X 3  X 7  X 4  X 8 Best selection: X 1  X 2  X 3  X 4  X 5  X 6  X 7  X 8

Dim X 1 vs. X 2

Dim X 1 vs. X 5

Ranking for 8D Gaussians Partitions of each attribute into 4, 8, 16, 24, 32 parts, with equal width. Methods that found perfect ranking: MI(C;f), IGR(C;f), WI(C,f), GD transinformation distance IC(f) : correct, except for P8, feature 2-6 reversed (6 is the noisy version of 2). Other, more sophisticated algorithms, made more errors. Selection for Gaussian distributions is rather easy using any evaluation measure. Simpler algorithms work better.

Selection for 8D Gaussians Partitions of each attribute into 4, 8, 16, 24, 32 parts, with equal width. Ideal selection: subsets with {1}, {1+2}, {1+2+3}, or { } attributes. 1. MI(C;f)  MI(f;g) algorithm: P24 no errors, for P8, 16, 32 small error (4  8) 2.Max MI(C;F): P8-24 no errors, P32 (3,4  7,8) 3.Max IC(C;F) : P24 no errors, P8 (2  6), P16 (3  7), P32 (3,4  7,8) 4.SSV decision tree based on separability criterion: creates its own discretization. Selects 1, 2, 6, 3, 7, others are not important. Univariate trees have bias for slanted distributions. Selection should take into account the type of classification system that will be used.

Discretization example MI index for 5 continuous features of the hypothyroid screening data. EP = equal width partition into 4, bins SSV = decision tree partition (discretization) into 4, bins

Hypothyroid: equal bins Mutual information for different number of equal width partitions, ordered from largest to smallest, for the hypothyroid data: 6 continuous and 15 binary attributes.

Hypothyroid: SSV bins Mutual information for different number of equal SSV decision tree partitions, ordered from largest to smallest, for the hypothyroid data. Values are twice as large since bins are more pure.

Hypothyroid: ranking Best ranking: largest area under curve: accuracy(best n features). SBL: evaluating and adding one attribute at a time (costly). Best 2: SBL, best 3: SSV BFS, best 4: SSV beam; BA - failure

Hypothyroid: ranking Results from FSM neurofuzzy system. Best 2: SBL, best 3: SSV BFS, best 4: SSV beam; BA – failure Global correlation misses local usefulness...

Hypothyroid: SSV ranking More results using FSM and selection based on SSV. SSV with beam search P24 finds the best small subsets, depending on the search depth; here best results for 5 attributes are achieved.

Leukemia: Bayes rules Top: test, bottom: train; green = p(C|X) for Gaussian-smoothed density with  =0.01, 0.02, 0.05, 0.20 (Zyxin).

Leukemia boosting 3 best genes, evaluation using bootstrap.

Leukemia boosting 3 best genes, evaluation using bootstrap.

Leukemia SVM LVO Problems with stability

GM example, Leukemia data Two types of leukemia, ALL and AML, 7129 genes, 38 train/34 test. Try SSV – decision trees are usually fast, no preprocessing. 1. Run SSV selecting 3 CVs; train accuracy is 100%, test 91% (3 err) and only 1 feature F4847 is used. 2. Removing it manually and running SSV creates a tree with F2020, 1 err train, 9 err test. This is not very useful SSV Stump Field shows ranking of 1-level trees, SSV may also be used for selection, but trees separate training data with F Standardize all the data, run SVM, 100% with all 38 as support vectors but test is 59% (14 errors). 5. Use SSV for selection of 10 features, run SVM with these features, train 100%, test 97% (1 error only). Small samples... but a profile of genes should be sufficient.

ConclusionsConclusions About 20 ranking and selection methods have been checked. The actual feature evaluation index (information, consistency, correlation etc) is not so important. Discretization is very important; naive equi-width or equidistance discretization may give unpredictable results; entropy-based discretization is fine, but separability-based is less expensive. Continuous kernel-based approximations to calculation of feature evaluation indices are a useful alternative. Ranking is easy if global evaluation is sufficient, but different sets of features may be important for separation of different classes, and some are important in small regions only – cf. decision trees. The lack of stability is the main problem: boosting does not help. Local selection/ranking with margins and stabilization is the most promising technique.

Open questions Discretization or kernel estimation? Best discretization: Vopt histograms, entropy, separability? Margin-based selection: like SVM. Stabilization using margins for individual vectors. Use of feature weighting from ranking/selection to scale input data. Use PCA on groups of redundant features. Evaluation indices more sensitive to local information? Statistical tests, such as Kolmogorow-Smirnov, to identify redundant features. Will anything work reliably for microarray feature selection? Different methods may use information contained in selected attributes in a different way, are filters sufficient?