1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Evaluation of Decision Forests on Text Categorization
Chapter 6 Sampling and Sampling Distributions
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
Lecture 3 Nonparametric density estimation and classification
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Mutual Information Mathematical Biology Seminar
MCS 2005 Round Table In the context of MCS, what do you believe to be true, even if you cannot yet prove it?
Chapter 7 Sampling and Sampling Distributions
Lecture 5 Outline – Tues., Jan. 27 Miscellanea from Lecture 4 Case Study Chapter 2.2 –Probability model for random sampling (see also chapter 1.4.1)
Evaluation.
Evaluating Hypotheses
Chapter Sampling Distributions and Hypothesis Testing.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Experimental Evaluation
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Lecture II-2: Probability Review
Radial Basis Function Networks
Standard Error of the Mean
Mathematical Processes GLE  I can recognize which symbol correlates with the correct term.  I can recall the correct definition for each mathematical.
Tin Kam Ho Bell Laboratories Lucent Technologies.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Chapter 7 Estimates and Sample Sizes
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Instructor Resource Chapter 5 Copyright © Scott B. Patten, Permission granted for classroom use with Epidemiology for Canadian Students: Principles,
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 Lecture 19: Hypothesis Tests Devore, Ch Topics I.Statistical Hypotheses (pl!) –Null and Alternative Hypotheses –Testing statistics and rejection.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
1 Part II: Practical Implementations.. 2 Modeling the Classes Stochastic Discrimination.
Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 9 Samples.
Classification Heejune Ahn SeoulTech Last updated May. 03.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Notes 1.3 (Part 1) An Overview of Statistics. What you will learn 1. How to design a statistical study 2. How to collect data by taking a census, using.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Machine Learning Chapter 5. Evaluating Hypotheses
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Two-Way (Independent) ANOVA. PSYC 6130A, PROF. J. ELDER 2 Two-Way ANOVA “Two-Way” means groups are defined by 2 independent variables. These IVs are typically.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Classification Ensemble Methods 1
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
1 Probability and Statistics Confidence Intervals.
Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1.3 Experimental Design. What is the goal of every statistical Study?  Collect data  Use data to make a decision If the process to collect data is flawed,
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Chapter 6 Sampling and Sampling Distributions
Probability and Statistics
COMP61011 : Machine Learning Ensemble Models
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 7-9
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Nonparametric density estimation and classification
Probability and Statistics
Presentation transcript:

1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies

2 Supervised Classification Discrimination, Anomaly Detection

3 Training Data in a Feature Space Given to learn the class boundary

4 Modeling the Classes Stochastic Discrimination

5 Make random guesses of the class models Use the training set to evaluate them Select and combine them to build a classifier … “decision fusion” But, does it work? Will any set of random guesses do it? Under what conditions does it work?

6 History Mathematical theory [Kleinberg 1990 AMAI, 1996 AoS, 2000 MCS] Development of theory [Berlind 1994 Thesis, Chen 1997 Thesis] Algorithms, experimentation, variants: [Kleinberg, Ho, Berlind, Bowen, Chen, Favata, Shekhawat, ] Outlines of algorithm [Kleinberg 2000 PAMI] Papers available at

7 Part I. The SD Principles.

8 Key Concepts and Tools in SD Set-theoretic abstraction Symmetry of probabilities in model space and feature space Enrichment / Uniformity / Projectability Convergence of discriminant by the law of large numbers

9 Set-Theoretic Abstraction Study classifiers by their decision regions Ignore all algorithmic details Two classifiers are equivalent if their decision regions are the same

10 The Combinatorics: set covering

11 Covering 5 points {A,B,C,D,E} using size 3 models {m 1,…} Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 How many m’s contain this pt? 1 __ 3 2 __ 3 2 __ 3 1 __ 3 __ 3

12 Promoting uniformity of coverage Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 How many m’s contain this pt?

13 Promoting uniformity of coverage Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 How many m’s contain this pt? 4 __ 6 3 __ 6 4 __ 6 3 __ 6 4 __ 6

14 A Uniform Cover of 5 points {A,B,C,D,E} using 10 models {m 1,…,m 10 } Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 m7m7 111 m8m8 111 m9m9 111 m /5 How many m’s contain this pt? 6 __ 10 6 __ 10 6 __ 10 6 __ 10 6 __ 10 =

15 Uniformity Implies Symmetry: The Counting Argument Count the number of pairs (q,m) such that “model m covers point q”, call this number N If each point is covered by the same number  of models (the collection is a uniform cover), N = 5 points in space x  covering models for each point = 3 points in each model x  models in the collection 5  = 3  =>  /  = 3 / 5 (  = 6,  = 10)

16 Example Given a feature space F containing a set A with 10 points: Consider all subsets m of F that cover exactly 5 points of A, e.g., m = {q1, q2, q6, q8, q9} Each model m has captured 5/10 = 0.5 of A Prob F (q  m| q  A) = 0.5 Call this set of models M 0.5, A q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

17 Some Members of M 0.5, A

18 q0 q1 q2 q3 q4 q5 q6 q7 q8 q

19 There are C(10,5) = 252 models in M 0.5, A Permute this set randomly to give m 1,m 2,…,m 252

20 First 10 Items Listed by the indices i of q i

21 Make collections of increasing size M 1 = {m 1 } M 2 = {m 1, m 2 } … M 252 = {m 1, m 2, …, m 252 } For each point q in A, count how many members of each M t cover q Normalize the count by size of M t, obtain Y(q,M t ) = ---  C m k (q) = Prob M (q  m| m  M t ) where C m (q) = 1 if q  m, = 0 otherwise t 1 t k=1

22

23 As t goes to 252, Y values become … The Y table continues …

24 Trace the value of Y(q, M t ) for each q as t increases Values of Y converge to 0.5 They are very close to 0.5 far before t=252

25 When t is large, we have Y(q,M t ) = Prob M (q  m| m  M t ) = 0.5 = Prob F (q  m| q  A) We have a symmetry of probabilities in two different spaces: M and F This is due to the uniform coverage of M t on A i.e., any two points in A are covered by the same number of models in M t

26 Two-class discrimination Label points in A with 2 classes: TR 1 = {q0, q1, q2, q7, q8} TR 2 = {q3, q4, q5, q6, q9} Calculate a rating of each model m for each class: r 1 = Prob F (q  m| q  TR 1 ) r 2 = Prob F (q  m| q  TR 2 ) q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

27 Enriched Models Ratings r 1 and r 2 describe how well m is in capturing classes c 1 and c 2 as observed with TR 1 and TR 2 r 1 (m) = Prob F (q  m| q  TR 1 ) r 2 (m) = Prob F (q  m| q  TR 2 ) e.g.m = {q1, q2, q6, q8, q9} r 1 (m) = 3/5enrichment degree d 12 (m) = r 2 (m) = 2/5r 1 (m)-r 2 (m) = 0.2 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

28 The Discriminant Recall C m (q) = 1 if q  m, = 0 otherwise Define X 12 (q,m) = Define a discriminant Y 12 (q,M t ) = ---  X 12 (q,m k ) C m (q) – r 2 (m) r 1 (m) – r 2 (m) 1 t t k=1

29

30 As t goes to 252, Y values become … The Y table continues … q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

31 Trace the value of Y(q, M t ) for each q as t increases Values of Y converge to 1 or 0 (1 for TR 1, 0 for TR 2 ) They are very close to 1 or 0 far before t=252

32 X 12 (q,m) = C m (q) – r 2 (m) r 1 (m) – r 2 (m) X 12 (q,m) = C m (q) Y 12 (q,M t ) = ---  X 12 (q,m k ) 1 t k=1 t Why?

33 Find the fraction of models of each rating that cover a fixed point q f Mt, r1, TR1 (q) and f Mt, r2, TR2 (q) Since M t is expanded in a uniform way, as t increases, for all x, f Mt, x, TRi (q)  x Profile of Coverage

34 Ratings of m in M t We have models of 6 different “types”

35 Profile of Coverage of q 0 at t=10 m 2,m 8 m 3,m 5. m 10

36 Ratings of m (repeated for reference)

37 Profile of Coverage for a fixed point q in TR i f Mt, r, TRi (q) = r r t f(q)

38 Profile of coverage as a function of r1: f Mt, r1, TR1 (q) Profile of coverage as a function of r2: f Mt, r2, TR2 (q) q5 q6 q7 q8 q9 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q0 q1 q2 q3 q4

39 Profile of coverage as a function of r1: f Mt, r1, TR1 (q) Profile of coverage as a function of r2: f Mt, r2, TR2 (q) q5 q6 q7 q8 q9 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q0 q1 q2 q3 q4

40 Decomposition of Y Can be shown to be 0 for q  TR 2 in a similar way. Duality due to uniformity

41 Projectability of Models If F has more than the training points q: If the models m are larger – not only including the q points but also their neighboring p, the same discriminant Y 12 can be used to classify the p points The points p and q are M t -indiscernible q0,p0 q1,p1 q2,p2 q3,p3 q4,p4 q5,p5 q6,p6 q7,p7 q8,p8 q9,p9

42 Example Definition of a Model q0,p0 q1,p1 q2,p2 q3,p3 q4,p4 q5,p5 q6,p6 q7,p7 q8,p8 q9,p9  Points within m are m-indiscernible

43 M-indiscernibility Points A, B within m are m-indiscernible m Two points cannot be distinguished w.r.t. the definition of m Within the descriptive power of m, the two points are considered the same Berlind’s hierarchy of indiscernibility A B

44 M-Indiscernibility: similarity w.r.t. a specific criterion Give me a book similar to this one. A clone? A photocopy on loose paper? A translation? Another adventure story? Another paperback book? …

45 Simulated vs. Real-World Data Model based anomaly detection How much of the real-world do we need to reproduce in the simulator? What properties is our discriminator sensitive to by construction? Which ones are don’t-cares? How do we measure the success? “Transductive learning”: no estimation of underlying distributions. Good as long as you can make correct predictions.

46 In the SD method: Model Size and Projectability Points within the same model share the same interpretation Larger models -> more stable the ratings are w.r.t. sampling differences -> more similar classification between TR and TE Tradeoff with ease to achieve uniformity

47 Some Possible Forms of Weak Models Choice of model form can be based on domain knowledge: What type of local generalization is expected?

48 Enrichment and Convergence Larger enrichment degree -> smaller variance in X -> Y converges faster But, models with large enrichment degree are more difficult to obtain Thus it is more difficult to achieve uniformity

49 Enrichment Projectability Uniformity The 3-way Tension

50 The 3-way Tension, in more familiar terms Complementary Information Discriminating Power Generalization Power Enrichment: Projectability: Uniformity:

51 Stochastic Discrimination A mathematical theory that relates several key concepts in pattern recognition: –Discriminative power –Complementary information –Generalization power The analysis gives guidance on constructing and understanding ensemble learning algorithms

52 Review Key Concepts and Tools in SD Set-theoretic abstraction Symmetry of probabilities in model or feature spaces Enrichment / Uniformity / Projectability Convergence of discriminant by the law of large numbers

53 Weak Models A weak model m is a subset of the feature space F It contains points sharing the same interpretation It should have a simple form, easy-to-compute membership function It should have some good size It may be cheaply produced by a stochastic process

54 Enriched Weak Models Rate a weak model by how well it captures points of each class Degree of enrichment is how much the model is biased between two classes A weak model is enriched if

55 The Stochastic Discriminant For point q and model m, classes i and j: For a collection of t weak models M t = {m 1,m 2,…,m t }:

56 A Uniform Cover The collection of models should cover the space uniformly – any two points of the same class should fall equally likely in models of a specific rating M is A-uniform if for every x = r(m,A) such that M x,A is nonempty, and for any two points p,q in A: We need a collection of models that is both TR i -uniform and TR j -uniform.

57 Counting models! Symmetry between Probabilities w.r.t. F and 2 F If M is A-uniform, then for all q in A, But by definition, for all m in M x,A Counting points!

58 Duality between Distributions of and

59 Convergence of the Discriminant With enriched weak models, values of X ij are distributed around 1 for points of class i and 0 for points of class j Y ij converges to E(x ij ) with variance 1/t that of X ij according to the law of large numbers

60 Convergence of the Discriminant Classifier obtainable within time proportional to 1/u (u = upper bound on error) and 1/d ij 2 (d ij = minimun enrichment degree)