Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies.

Similar presentations


Presentation on theme: "1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies."— Presentation transcript:

1 1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies

2 2 Supervised Classification Discrimination, Anomaly Detection

3 3 Training Data in a Feature Space Given to learn the class boundary

4 4 Modeling the Classes Stochastic Discrimination

5 5 Make random guesses of the class models Use the training set to evaluate them Select and combine them to build a classifier … “decision fusion” But, does it work? Will any set of random guesses do it? Under what conditions does it work?

6 6 History Mathematical theory [Kleinberg 1990 AMAI, 1996 AoS, 2000 MCS] Development of theory [Berlind 1994 Thesis, Chen 1997 Thesis] Algorithms, experimentation, variants: [Kleinberg, Ho, Berlind, Bowen, Chen, Favata, Shekhawat, 1993 -] Outlines of algorithm [Kleinberg 2000 PAMI] Papers available at http://kappa.math.buffalo.edu/sd

7 7 Part I. The SD Principles.

8 8 Key Concepts and Tools in SD Set-theoretic abstraction Symmetry of probabilities in model space and feature space Enrichment / Uniformity / Projectability Convergence of discriminant by the law of large numbers

9 9 Set-Theoretic Abstraction Study classifiers by their decision regions Ignore all algorithmic details Two classifiers are equivalent if their decision regions are the same

10 10 The Combinatorics: set covering

11 11 Covering 5 points {A,B,C,D,E} using size 3 models {m 1,…} Member? ABCDEHow many pts are in this m? m1m1 111 3/5 m2m2 111 m3m3 111 How many m’s contain this pt? 1 __ 3 2 __ 3 2 __ 3 1 __ 3 __ 3

12 12 Promoting uniformity of coverage Member? ABCDEHow many pts are in this m? m1m1 111 3/5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 How many m’s contain this pt?

13 13 Promoting uniformity of coverage Member? ABCDEHow many pts are in this m? m1m1 111 3/5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 How many m’s contain this pt? 4 __ 6 3 __ 6 4 __ 6 3 __ 6 4 __ 6

14 14 A Uniform Cover of 5 points {A,B,C,D,E} using 10 models {m 1,…,m 10 } Member? ABCDEHow many pts are in this m? m1m1 111 3/5 m2m2 111 m3m3 111 m4m4 111 m5m5 111 m6m6 111 m7m7 111 m8m8 111 m9m9 111 m 10 111 3/5 How many m’s contain this pt? 6 __ 10 6 __ 10 6 __ 10 6 __ 10 6 __ 10 =

15 15 Uniformity Implies Symmetry: The Counting Argument Count the number of pairs (q,m) such that “model m covers point q”, call this number N If each point is covered by the same number  of models (the collection is a uniform cover), N = 5 points in space x  covering models for each point = 3 points in each model x  models in the collection 5  = 3  =>  /  = 3 / 5 (  = 6,  = 10)

16 16 Example Given a feature space F containing a set A with 10 points: Consider all subsets m of F that cover exactly 5 points of A, e.g., m = {q1, q2, q6, q8, q9} Each model m has captured 5/10 = 0.5 of A Prob F (q  m| q  A) = 0.5 Call this set of models M 0.5, A q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

17 17 Some Members of M 0.5, A

18 18 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 01234 01235 01236 23456 12678 13579

19 19 There are C(10,5) = 252 models in M 0.5, A Permute this set randomly to give m 1,m 2,…,m 252

20 20 First 10 Items Listed by the indices i of q i

21 21 Make collections of increasing size M 1 = {m 1 } M 2 = {m 1, m 2 } … M 252 = {m 1, m 2, …, m 252 } For each point q in A, count how many members of each M t cover q Normalize the count by size of M t, obtain Y(q,M t ) = ---  C m k (q) = Prob M (q  m| m  M t ) where C m (q) = 1 if q  m, = 0 otherwise t 1 t k=1

22 22

23 23 As t goes to 252, Y values become … The Y table continues …

24 24 Trace the value of Y(q, M t ) for each q as t increases Values of Y converge to 0.5 They are very close to 0.5 far before t=252

25 25 When t is large, we have Y(q,M t ) = Prob M (q  m| m  M t ) = 0.5 = Prob F (q  m| q  A) We have a symmetry of probabilities in two different spaces: M and F This is due to the uniform coverage of M t on A i.e., any two points in A are covered by the same number of models in M t

26 26 Two-class discrimination Label points in A with 2 classes: TR 1 = {q0, q1, q2, q7, q8} TR 2 = {q3, q4, q5, q6, q9} Calculate a rating of each model m for each class: r 1 = Prob F (q  m| q  TR 1 ) r 2 = Prob F (q  m| q  TR 2 ) q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

27 27 Enriched Models Ratings r 1 and r 2 describe how well m is in capturing classes c 1 and c 2 as observed with TR 1 and TR 2 r 1 (m) = Prob F (q  m| q  TR 1 ) r 2 (m) = Prob F (q  m| q  TR 2 ) e.g.m = {q1, q2, q6, q8, q9} r 1 (m) = 3/5enrichment degree d 12 (m) = r 2 (m) = 2/5r 1 (m)-r 2 (m) = 0.2 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

28 28 The Discriminant Recall C m (q) = 1 if q  m, = 0 otherwise Define X 12 (q,m) = ------------------ Define a discriminant Y 12 (q,M t ) = ---  X 12 (q,m k ) C m (q) – r 2 (m) r 1 (m) – r 2 (m) 1 t t k=1

29 29

30 30 As t goes to 252, Y values become … The Y table continues … q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

31 31 Trace the value of Y(q, M t ) for each q as t increases Values of Y converge to 1 or 0 (1 for TR 1, 0 for TR 2 ) They are very close to 1 or 0 far before t=252

32 32 X 12 (q,m) = ------------------ C m (q) – r 2 (m) r 1 (m) – r 2 (m) X 12 (q,m) = C m (q) Y 12 (q,M t ) = ---  X 12 (q,m k ) 1 t k=1 t Why?

33 33 Find the fraction of models of each rating that cover a fixed point q f Mt, r1, TR1 (q) and f Mt, r2, TR2 (q) Since M t is expanded in a uniform way, as t increases, for all x, f Mt, x, TRi (q)  x Profile of Coverage

34 34 Ratings of m in M t We have models of 6 different “types”

35 35 Profile of Coverage of q 0 at t=10 m 2,m 8 m 3,m 5. m 10

36 36 Ratings of m (repeated for reference)

37 37 Profile of Coverage for a fixed point q in TR i f Mt, r, TRi (q) = r r t f(q)

38 38 Profile of coverage as a function of r1: f Mt, r1, TR1 (q) Profile of coverage as a function of r2: f Mt, r2, TR2 (q) q5 q6 q7 q8 q9 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q0 q1 q2 q3 q4

39 39 Profile of coverage as a function of r1: f Mt, r1, TR1 (q) Profile of coverage as a function of r2: f Mt, r2, TR2 (q) q5 q6 q7 q8 q9 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q0 q1 q2 q3 q4

40 40 Decomposition of Y Can be shown to be 0 for q  TR 2 in a similar way. Duality due to uniformity

41 41 Projectability of Models If F has more than the training points q: If the models m are larger – not only including the q points but also their neighboring p, the same discriminant Y 12 can be used to classify the p points The points p and q are M t -indiscernible q0,p0 q1,p1 q2,p2 q3,p3 q4,p4 q5,p5 q6,p6 q7,p7 q8,p8 q9,p9

42 42 Example Definition of a Model q0,p0 q1,p1 q2,p2 q3,p3 q4,p4 q5,p5 q6,p6 q7,p7 q8,p8 q9,p9  Points within m are m-indiscernible

43 43 M-indiscernibility Points A, B within m are m-indiscernible m Two points cannot be distinguished w.r.t. the definition of m Within the descriptive power of m, the two points are considered the same Berlind’s hierarchy of indiscernibility A B

44 44 M-Indiscernibility: similarity w.r.t. a specific criterion Give me a book similar to this one. A clone? A photocopy on loose paper? A translation? Another adventure story? Another paperback book? …

45 45 Simulated vs. Real-World Data Model based anomaly detection How much of the real-world do we need to reproduce in the simulator? What properties is our discriminator sensitive to by construction? Which ones are don’t-cares? How do we measure the success? “Transductive learning”: no estimation of underlying distributions. Good as long as you can make correct predictions.

46 46 In the SD method: Model Size and Projectability Points within the same model share the same interpretation Larger models -> more stable the ratings are w.r.t. sampling differences -> more similar classification between TR and TE Tradeoff with ease to achieve uniformity

47 47 Some Possible Forms of Weak Models Choice of model form can be based on domain knowledge: What type of local generalization is expected?

48 48 Enrichment and Convergence Larger enrichment degree -> smaller variance in X -> Y converges faster But, models with large enrichment degree are more difficult to obtain Thus it is more difficult to achieve uniformity

49 49 Enrichment Projectability Uniformity The 3-way Tension

50 50 The 3-way Tension, in more familiar terms Complementary Information Discriminating Power Generalization Power Enrichment: Projectability: Uniformity:

51 51 Stochastic Discrimination A mathematical theory that relates several key concepts in pattern recognition: –Discriminative power –Complementary information –Generalization power The analysis gives guidance on constructing and understanding ensemble learning algorithms

52 52 Review Key Concepts and Tools in SD Set-theoretic abstraction Symmetry of probabilities in model or feature spaces Enrichment / Uniformity / Projectability Convergence of discriminant by the law of large numbers

53 53 Weak Models A weak model m is a subset of the feature space F It contains points sharing the same interpretation It should have a simple form, easy-to-compute membership function It should have some good size It may be cheaply produced by a stochastic process

54 54 Enriched Weak Models Rate a weak model by how well it captures points of each class Degree of enrichment is how much the model is biased between two classes A weak model is enriched if

55 55 The Stochastic Discriminant For point q and model m, classes i and j: For a collection of t weak models M t = {m 1,m 2,…,m t }:

56 56 A Uniform Cover The collection of models should cover the space uniformly – any two points of the same class should fall equally likely in models of a specific rating M is A-uniform if for every x = r(m,A) such that M x,A is nonempty, and for any two points p,q in A: We need a collection of models that is both TR i -uniform and TR j -uniform.

57 57 Counting models! Symmetry between Probabilities w.r.t. F and 2 F If M is A-uniform, then for all q in A, But by definition, for all m in M x,A Counting points!

58 58 Duality between Distributions of and

59 59 Convergence of the Discriminant With enriched weak models, values of X ij are distributed around 1 for points of class i and 0 for points of class j Y ij converges to E(x ij ) with variance 1/t that of X ij according to the law of large numbers

60 60 Convergence of the Discriminant Classifier obtainable within time proportional to 1/u (u = upper bound on error) and 1/d ij 2 (d ij = minimun enrichment degree)


Download ppt "1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies."

Similar presentations


Ads by Google