Presentation is loading. Please wait.

Presentation is loading. Please wait.

PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.

Similar presentations


Presentation on theme: "PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby."— Presentation transcript:

1 PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby

2 Motivation We do not cluster the data just for the sake of it, but rather to facilitate a solution of some higher level task The quality of clustering should be evaluated by its contribution to the solution of that task The main obstacle in the development of unsupervised learning is the absence of a good formulation of a context of its application

3 Outline Two problems behind co-clustering –Discriminative prediction –Density estimation PAC-Bayesian analysis of discriminative prediction with co-clustering –Combinatorial priors PAC-Bayesian bound for discrete density estimation –Density estimation with co-clustering Extensions

4 Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Y Y Y X 2 (movies) X 1 (viewers)

5 Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Evaluation: Y Y Y X 2 (movies) X 1 (viewers) Expectation w.r.t. the true distribution p(X 1,X 2,Y) Expectation w.r.t. the classifier q(Y|X 1,X 2 ) Given loss l(Y,Y’)

6 Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Evaluation: Model-independent comparison Y Y Y X 2 (movies) X 1 (viewers)

7 Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) X 2 (words) X 1 (documents)

8 Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) Evaluation: X 2 (words) X 1 (documents) The true distribution p(X 1,X 2 )

9 Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) Evaluation: Model-independent comparison X 2 (words) X 1 (documents)

10 Discriminative prediction based on co-clustering Y Y Y q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) q(Y|C1,C2)q(Y|C1,C2) X2X2 X1X1 Model: X1X1 X2X2 C1C1 C2C2 Y

11 PAC-Bayesian Analysis H – all hard partitions + labels for partition cells P – combinatorial (next slide) Q – {q(C 1 |X 1 ), q(C 2 |X 2 ), q(Y|C 1,C 2 )} q(x i,c i ) = 1/|X i | q(c i |x i ) Y Y Y q (C1|X1)q (C1|X1) q (C2|X2)q (C2|X2) q(Y|C1,C2)q(Y|C1,C2) X2X2 X1X1

12 Prior Construction |X i | possibilities to choose |C i | X2X2 X1X1

13 Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i X2X2 X1X1

14 Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s

15 Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s |Y| |C 1 ||C 2 | possibilities to assign labels to partition cells X2X2 X1X1 Y

16 Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s |Y| |C 1 ||C 2 | possibilities to assign labels to partition cells X2X2 X1X1 Y

17 Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i )

18 Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q)

19 Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤  i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y|

20 Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤  i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y| -H(Q) ≤ -|X i |H(C i |X i )

21 Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤  i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y| -H(Q) ≤ -|X i |H(C i |X i ) D(Q||P) ≤  i [ |X i |I(X i ;C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y|

22 Generalization Bound With probability ≥ 1-  : Logarithmic in |X i | Number of partition cells PAC-Bayesian bound part

23 Generalization Bound With probability ≥ 1-  : Low Complexity I(X i ;C i ) = 0 High Complexity I(X i ;C i ) = ln|X i |

24 Generalization Bound With probability ≥ 1-  : Optimization tradeoff: Empirical loss vs. “Effective” partition complexity Lower Complexity Higher Complexity

25 Application Replace with a trade-off MovieLens dataset –100,000 ratings on 5-star scale –80,000 train ratings, 20,000 test ratings –943 viewers x 1682 movies –State-of-the-art Mean Absolute Error (0.72) –The optimal performance is achieved even with 300x300 cluster space

26 13x6 Clusters Mean Absolute Error Bound Test MAE 

27 50x50 Clusters Test MAE Bound Test MAE Mean Absolute Error 

28 283x283 Clusters  Bound Test MAE Mean Absolute Error

29 PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h  H is a function h:X  Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z}

30 PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h  H is a function h:X  Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z} –P – prior over H, Q – posterior over H –p Q (Z)=E Q(h) p h (Z)

31 PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h  H is a function h:X  Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z} –P – prior over H, Q – posterior over H –p Q (Z)=E Q(h) p h (Z) PAC-Bayesian Bound for Density Estimation: –With probability ≥ 1-δ for all Q simultaneously:

32 PAC-Bayesian bound for classification Density estimation: with probability ≥ 1-  : Classification = density estimation of the error variable If Z is the error variable, then |Z|=2 and p Q (Z)=L(Q):

33 Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h)

34 Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h) Choose: Show that E S e  (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N }

35 Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h) Choose: Show that E S e  (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N } E S E P(h) e  (h) = E P(h) E S e  (h) ≤ (N+1) |Z|-1

36 Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h) Choose: Show that E S e  (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N } E S E P(h) e  (h) = E P(h) E S e  (h) ≤ (N+1) |Z|-1 By Markov’s inequality w.p. ≥ 1-  :

37 A Bound on

38

39

40 Density Estimation with Co-clustering X2X2 X1X1 Given: Estimate: p(X 1,X 2 ) Minimize: –E p(X 1,X 2 ) ln q(X 1,X 2 )

41 Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 Model: Q = {q(C 1 |X 1 ),q(C 2 |X 2 )} X1X1 X2X2 C1C1 C2C2

42 Estimator: Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 X1X1 X2X2 C1C1 C2C2

43 Model: Q = {q(C 1 |X 1 ),q(C 2 |X 2 )} With probability ≥ 1-  : Density Estimation with Co-clustering X2X2 X1X1

44 With probability ≥ 1-  : Related work –Information-Theoretic Co-clustering [Dhillon et. al. ’03] : maximize I(C 1 ;C 2 ) alone –PAC-Bayesian approach provides regularization and model order selection Density Estimation with Co-clustering

45 Graphical Models Discriminative Prediction X1X1 X2X2 C1C1 C2C2 Y X1X1 X2X2 C1C1 C2C2 Density Estimation

46 Tree-Shaped Graphical Models Generalization bound: –Trade-off between empirical performance and the mutual information propagated up the tree Future work: –Optimization algorithms –Applications –More general graphical models C1C1 D1D1 X3X3 D2D2 Y C2C2 C3C3 C4C4 X2X2 X4X4 X1X1

47 Graph Clustering, Pairwise Clustering The weights of the links are generated according to: q(w ij |X i,X j ) =  C a,C b q(w ij |C a,C b ) q(C a |X i ) q(C b |X j ) This is the co-clustering model with shared q(C|X) –Same bounds and algorithms apply

48 Summary of main contributions PAC-Bayesian analysis of unsupervised learning –Co-clustering –Tree-shaped graphical models –Graph clustering, pairwise clustering PAC-Bayesian bound for discrete density estimation Combinatorial priors  mutual information regularization terms

49 Future Directions PAC-Bayesian analysis of unsupervised learning –Clustering –Continuous density estimation –General graphical models PAC-Bayesian analysis of reinforcement learning References Seldin and Tishby ICML 2008, AISTATS 2009, JMLR 2010


Download ppt "PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby."

Similar presentations


Ads by Google