1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies
2 Supervised Classification Discrimination, Anomaly Detection
3 Training Data in a Feature Space Given to learn the class boundary
4 Modeling the Classes Stochastic Discrimination
5 Make random guesses of the class models Use the training set to evaluate them Select and combine them to build a classifier … “decision fusion” But, does it work? Will any set of random guesses do it? Under what conditions does it work?
6 History Mathematical theory [Kleinberg 1990 AMAI, 1996 AoS, 2000 MCS] Development of theory [Berlind 1994 Thesis, Chen 1997 Thesis] Algorithms, experimentation, variants: [Kleinberg, Ho, Berlind, Bowen, Chen, Favata, Shekhawat, ] Outlines of algorithm [Kleinberg 2000 PAMI] Papers available at
7 Part I. The SD Principles.
8 Key Concepts and Tools in SD Set-theoretic abstraction Symmetry of probabilities in model space and feature space Enrichment / Uniformity / Projectability Convergence of discriminant by the law of large numbers
9 Set-Theoretic Abstraction Study classifiers by their decision regions Ignore all algorithmic details Two classifiers are equivalent if their decision regions are the same
10 The Combinatorics: set covering
11 Covering 5 points {A,B,C,D,E} using size 3 models {m 1,…} Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 How many m’s contain this pt? 1 __ 3 2 __ 3 2 __ 3 1 __ 3 __ 3
12 Promoting uniformity of coverage Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 How many m’s contain this pt?
13 Promoting uniformity of coverage Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 How many m’s contain this pt? 4 __ 6 3 __ 6 4 __ 6 3 __ 6 4 __ 6
14 A Uniform Cover of 5 points {A,B,C,D,E} using 10 models {m 1,…,m 10 } Member? ABCDEHow many pts are in this m? m1m /5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 m7m7 111 m8m8 111 m9m9 111 m /5 How many m’s contain this pt? 6 __ 10 6 __ 10 6 __ 10 6 __ 10 6 __ 10 =
15 Uniformity Implies Symmetry: The Counting Argument Count the number of pairs (q,m) such that “model m covers point q”, call this number N If each point is covered by the same number of models (the collection is a uniform cover), N = 5 points in space x covering models for each point = 3 points in each model x models in the collection 5 = 3 => / = 3 / 5 ( = 6, = 10)
16 Example Given a feature space F containing a set A with 10 points: Consider all subsets m of F that cover exactly 5 points of A, e.g., m = {q1, q2, q6, q8, q9} Each model m has captured 5/10 = 0.5 of A Prob F (q m| q A) = 0.5 Call this set of models M 0.5, A q0 q1 q2 q3 q4 q5 q6 q7 q8 q9
17 Some Members of M 0.5, A
18 q0 q1 q2 q3 q4 q5 q6 q7 q8 q
19 There are C(10,5) = 252 models in M 0.5, A Permute this set randomly to give m 1,m 2,…,m 252
20 First 10 Items Listed by the indices i of q i
21 Make collections of increasing size M 1 = {m 1 } M 2 = {m 1, m 2 } … M 252 = {m 1, m 2, …, m 252 } For each point q in A, count how many members of each M t cover q Normalize the count by size of M t, obtain Y(q,M t ) = --- C m k (q) = Prob M (q m| m M t ) where C m (q) = 1 if q m, = 0 otherwise t 1 t k=1
22
23 As t goes to 252, Y values become … The Y table continues …
24 Trace the value of Y(q, M t ) for each q as t increases Values of Y converge to 0.5 They are very close to 0.5 far before t=252
25 When t is large, we have Y(q,M t ) = Prob M (q m| m M t ) = 0.5 = Prob F (q m| q A) We have a symmetry of probabilities in two different spaces: M and F This is due to the uniform coverage of M t on A i.e., any two points in A are covered by the same number of models in M t
26 Two-class discrimination Label points in A with 2 classes: TR 1 = {q0, q1, q2, q7, q8} TR 2 = {q3, q4, q5, q6, q9} Calculate a rating of each model m for each class: r 1 = Prob F (q m| q TR 1 ) r 2 = Prob F (q m| q TR 2 ) q0 q1 q2 q3 q4 q5 q6 q7 q8 q9
27 Enriched Models Ratings r 1 and r 2 describe how well m is in capturing classes c 1 and c 2 as observed with TR 1 and TR 2 r 1 (m) = Prob F (q m| q TR 1 ) r 2 (m) = Prob F (q m| q TR 2 ) e.g.m = {q1, q2, q6, q8, q9} r 1 (m) = 3/5enrichment degree d 12 (m) = r 2 (m) = 2/5r 1 (m)-r 2 (m) = 0.2 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9
28 The Discriminant Recall C m (q) = 1 if q m, = 0 otherwise Define X 12 (q,m) = Define a discriminant Y 12 (q,M t ) = --- X 12 (q,m k ) C m (q) – r 2 (m) r 1 (m) – r 2 (m) 1 t t k=1
29
30 As t goes to 252, Y values become … The Y table continues … q0 q1 q2 q3 q4 q5 q6 q7 q8 q9
31 Trace the value of Y(q, M t ) for each q as t increases Values of Y converge to 1 or 0 (1 for TR 1, 0 for TR 2 ) They are very close to 1 or 0 far before t=252
32 X 12 (q,m) = C m (q) – r 2 (m) r 1 (m) – r 2 (m) X 12 (q,m) = C m (q) Y 12 (q,M t ) = --- X 12 (q,m k ) 1 t k=1 t Why?
33 Find the fraction of models of each rating that cover a fixed point q f Mt, r1, TR1 (q) and f Mt, r2, TR2 (q) Since M t is expanded in a uniform way, as t increases, for all x, f Mt, x, TRi (q) x Profile of Coverage
34 Ratings of m in M t We have models of 6 different “types”
35 Profile of Coverage of q 0 at t=10 m 2,m 8 m 3,m 5. m 10
36 Ratings of m (repeated for reference)
37 Profile of Coverage for a fixed point q in TR i f Mt, r, TRi (q) = r r t f(q)
38 Profile of coverage as a function of r1: f Mt, r1, TR1 (q) Profile of coverage as a function of r2: f Mt, r2, TR2 (q) q5 q6 q7 q8 q9 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q0 q1 q2 q3 q4
39 Profile of coverage as a function of r1: f Mt, r1, TR1 (q) Profile of coverage as a function of r2: f Mt, r2, TR2 (q) q5 q6 q7 q8 q9 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q0 q1 q2 q3 q4
40 Decomposition of Y Can be shown to be 0 for q TR 2 in a similar way. Duality due to uniformity
41 Projectability of Models If F has more than the training points q: If the models m are larger – not only including the q points but also their neighboring p, the same discriminant Y 12 can be used to classify the p points The points p and q are M t -indiscernible q0,p0 q1,p1 q2,p2 q3,p3 q4,p4 q5,p5 q6,p6 q7,p7 q8,p8 q9,p9
42 Example Definition of a Model q0,p0 q1,p1 q2,p2 q3,p3 q4,p4 q5,p5 q6,p6 q7,p7 q8,p8 q9,p9 Points within m are m-indiscernible
43 M-indiscernibility Points A, B within m are m-indiscernible m Two points cannot be distinguished w.r.t. the definition of m Within the descriptive power of m, the two points are considered the same Berlind’s hierarchy of indiscernibility A B
44 M-Indiscernibility: similarity w.r.t. a specific criterion Give me a book similar to this one. A clone? A photocopy on loose paper? A translation? Another adventure story? Another paperback book? …
45 Simulated vs. Real-World Data Model based anomaly detection How much of the real-world do we need to reproduce in the simulator? What properties is our discriminator sensitive to by construction? Which ones are don’t-cares? How do we measure the success? “Transductive learning”: no estimation of underlying distributions. Good as long as you can make correct predictions.
46 In the SD method: Model Size and Projectability Points within the same model share the same interpretation Larger models -> more stable the ratings are w.r.t. sampling differences -> more similar classification between TR and TE Tradeoff with ease to achieve uniformity
47 Some Possible Forms of Weak Models Choice of model form can be based on domain knowledge: What type of local generalization is expected?
48 Enrichment and Convergence Larger enrichment degree -> smaller variance in X -> Y converges faster But, models with large enrichment degree are more difficult to obtain Thus it is more difficult to achieve uniformity
49 Enrichment Projectability Uniformity The 3-way Tension
50 The 3-way Tension, in more familiar terms Complementary Information Discriminating Power Generalization Power Enrichment: Projectability: Uniformity:
51 Stochastic Discrimination A mathematical theory that relates several key concepts in pattern recognition: –Discriminative power –Complementary information –Generalization power The analysis gives guidance on constructing and understanding ensemble learning algorithms
52 Review Key Concepts and Tools in SD Set-theoretic abstraction Symmetry of probabilities in model or feature spaces Enrichment / Uniformity / Projectability Convergence of discriminant by the law of large numbers
53 Weak Models A weak model m is a subset of the feature space F It contains points sharing the same interpretation It should have a simple form, easy-to-compute membership function It should have some good size It may be cheaply produced by a stochastic process
54 Enriched Weak Models Rate a weak model by how well it captures points of each class Degree of enrichment is how much the model is biased between two classes A weak model is enriched if
55 The Stochastic Discriminant For point q and model m, classes i and j: For a collection of t weak models M t = {m 1,m 2,…,m t }:
56 A Uniform Cover The collection of models should cover the space uniformly – any two points of the same class should fall equally likely in models of a specific rating M is A-uniform if for every x = r(m,A) such that M x,A is nonempty, and for any two points p,q in A: We need a collection of models that is both TR i -uniform and TR j -uniform.
57 Counting models! Symmetry between Probabilities w.r.t. F and 2 F If M is A-uniform, then for all q in A, But by definition, for all m in M x,A Counting points!
58 Duality between Distributions of and
59 Convergence of the Discriminant With enriched weak models, values of X ij are distributed around 1 for points of class i and 0 for points of class j Y ij converges to E(x ij ) with variance 1/t that of X ij according to the law of large numbers
60 Convergence of the Discriminant Classifier obtainable within time proportional to 1/u (u = upper bound on error) and 1/d ij 2 (d ij = minimun enrichment degree)