Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
Rule extraction in neural networks. A survey. Krzysztof Mossakowski Faculty of Mathematics and Information Science Warsaw University of Technology.
Data Mining Classification: Alternative Techniques
Lecture 3 Nonparametric density estimation and classification
Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.
What is Statistical Modeling
Computational Intelligence for Information Selection
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
CS292 Computational Vision and Language Pattern Recognition and Classification.
Mutual Information Mathematical Biology Seminar
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Class notes for ISE 201 San Jose State University
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Evaluation.
Information Theory and Security. Lecture Motivation Up to this point we have seen: –Classical Crypto –Symmetric Crypto –Asymmetric Crypto These systems.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CS Instance Based Learning1 Instance Based Learning.
Information Theory and Security
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Probability & the Normal Distribution
Modeling and Simulation CS 313
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Overview Data Mining - classification and clustering
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Modeling of Core Protection Calculator System Software February 28, 2005 Kim, Sung Ho Kim, Sung Ho.
Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications
Computational Intelligence: Methods and Applications
Dipartimento di Ingegneria «Enzo Ferrari»,
Hiroki Sayama NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory p I(p)
Computational Intelligence: Methods and Applications
Vincent Granville, Ph.D. Co-Founder, DSC
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
LECTURE 23: INFORMATION THEORY REVIEW
Parametric Methods Berlin Chen, 2005 References:
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

Knowledge from kNN? Since all training vectors are used decision borders are complex and hard to understand in logical (symbolic terms). Reduce the number of reference vectors R i one vector per class with Euclidean distance measure LDA linear machine solution! Select prototype vectors by: removing and checking the accuracy using the leave-one-out test adding cluster center and adapting its positions to maximize L1O Example: Wisconsin breast cancer dataset, simplest known description. 699 samples, 458 benign (65.5%) and 241 (34.5%) malignant cancer cases, 9 features - integers ranging IF D(X,R 681 ) < THEN malignant, ELSE benign Accuracy 97.4%, Specificity 98.8%, Sensitivity 96.8%, so far the best! R2R2 R1R1

Decision tables kNN is a local majority classifier. Decision Tables is a similar algorithm that works only for nominal features: given a training data Table(R,C), try to reduce the number of features and the number of samples without decreasing accuracy. Decision rule is: find all R= reduced X, predict majority of R class. Reducing features may make some vectors X identical. Example: if only 1 feature is left, X 1 =0 or 1, a number of original training data vectors R for each value of X 1 will have  1 and  2 labels. If  1 is in majority for X 1 =0, and  2 for X 1 =1, then any new sample with X 1 =0 (or X 1 =1) is classified to  1 ( or  2 ). Search techniques are used to find reduced features: find and remove feature that has no influence (or that increases) accuracy, repeat.

Decision tables in WEKA/Yale If reduced X does not match any R in the table, majority class is predicted. Decision table algorithm looks for reduced set of features, this is similar to what the “rough set” programs do. WEKA does selection of features and instances using information- theoretic measures, and search using best-first technique. -S num Number of fully expanded non improving subsets to consider before terminating a best first search. (Default = 5) -X num: Use cross validation to evaluate feature usefulness. Use number of folds = 1 for leave one out CV. (Default = leave one out CV) -I: Use nearest neighbor instead of global table majority.

Concept of information Information may be measured by the average amount of surprise of observing X (data, signal, object). 1. If P(X)=1 there is no surprise, so s(X)=0 2. If P(X)=0 then this is a big surprise, so s(X)= . 3.If two observations X, Y are independent than P(X,Y)=P(X)P(Y), but the amount of surprise should be a sum s(X,Y)=s(X)+s(Y). The only suitable surprise function that fulfills these requirements is... The average amount of surprise is called information or entropy. Entropy is a measure of disorder, information is the change in disorder.

Information Information derived from observations of variable X (vector variable, signal or some object) that has n possible values is thus defined as: If the variable X is continuous with distribution P(x) an integral is taken instead of the sum: What type of logarithm should be used? Consider binary event with P(X (1) )=P(X (2) )=0.5, like tossing a coin: how much do we learn each time? A bit. Exactly one bit. Taking lg 2 will give:

Distributions For a scalar variable P(X (i) ) may be displayed in form of a histogram, and information (entropy) calculated for each histogram.

Joint information Other ways of introducing the concept of information start from the number of bits needed to code a signal. Suppose now that two variables, X and Y, are observed. Joint information is: For two uncorrelated features this is equal to: Since:

Conditional information If the value of Y variable is fixed and X is not quite independent conditional information (average “conditional surprise”) may be useful: Prove that

Mutual information The information in the X variable that is shared with Y is defined as:

Kullback-Leibler divergence If two distributions for X variable are compared, their divergence is expressed by: KL divergence is the expected value of the ratio of two distributions, it is non-negative, but not symmetric in p, q, so it is not a distance, although sometimes it is referred to as “KL-distance”. Mutual information is equal to the KL divergence between joint and product (independent) distributions:

Graphical relationships H(X,Y) So the total joint information is (prove it!) H(X)H(X) H(X|Y) H(Y)H(Y) MI(X;Y) H(Y|X)