Download presentation
Presentation is loading. Please wait.
Published byAlaina Phillips Modified over 9 years ago
1
PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby
2
Motivation We do not cluster the data just for the sake of it, but rather to facilitate a solution of some higher level task The quality of clustering should be evaluated by its contribution to the solution of that task The main obstacle in the development of unsupervised learning is the absence of a good formulation of a context of its application
3
Outline Two problems behind co-clustering –Discriminative prediction –Density estimation PAC-Bayesian analysis of discriminative prediction with co-clustering –Combinatorial priors PAC-Bayesian bound for discrete density estimation –Density estimation with co-clustering Extensions
4
Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Y Y Y X 2 (movies) X 1 (viewers)
5
Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Evaluation: Y Y Y X 2 (movies) X 1 (viewers) Expectation w.r.t. the true distribution p(X 1,X 2,Y) Expectation w.r.t. the classifier q(Y|X 1,X 2 ) Given loss l(Y,Y’)
6
Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Evaluation: Model-independent comparison Y Y Y X 2 (movies) X 1 (viewers)
7
Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) X 2 (words) X 1 (documents)
8
Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) Evaluation: X 2 (words) X 1 (documents) The true distribution p(X 1,X 2 )
9
Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) Evaluation: Model-independent comparison X 2 (words) X 1 (documents)
10
Discriminative prediction based on co-clustering Y Y Y q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) q(Y|C1,C2)q(Y|C1,C2) X2X2 X1X1 Model: X1X1 X2X2 C1C1 C2C2 Y
11
PAC-Bayesian Analysis H – all hard partitions + labels for partition cells P – combinatorial (next slide) Q – {q(C 1 |X 1 ), q(C 2 |X 2 ), q(Y|C 1,C 2 )} q(x i,c i ) = 1/|X i | q(c i |x i ) Y Y Y q (C1|X1)q (C1|X1) q (C2|X2)q (C2|X2) q(Y|C1,C2)q(Y|C1,C2) X2X2 X1X1
12
Prior Construction |X i | possibilities to choose |C i | X2X2 X1X1
13
Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i X2X2 X1X1
14
Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s
15
Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s |Y| |C 1 ||C 2 | possibilities to assign labels to partition cells X2X2 X1X1 Y
16
Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s |Y| |C 1 ||C 2 | possibilities to assign labels to partition cells X2X2 X1X1 Y
17
Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i ) q(C i )=1/|X i | x i q(C i |x i )
18
Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i ) q(C i )=1/|X i | x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q)
19
Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i ) q(C i )=1/|X i | x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤ i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y|
20
Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i ) q(C i )=1/|X i | x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤ i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y| -H(Q) ≤ -|X i |H(C i |X i )
21
Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i ) q(C i )=1/|X i | x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤ i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y| -H(Q) ≤ -|X i |H(C i |X i ) D(Q||P) ≤ i [ |X i |I(X i ;C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y|
22
Generalization Bound With probability ≥ 1- : Logarithmic in |X i | Number of partition cells PAC-Bayesian bound part
23
Generalization Bound With probability ≥ 1- : Low Complexity I(X i ;C i ) = 0 High Complexity I(X i ;C i ) = ln|X i |
24
Generalization Bound With probability ≥ 1- : Optimization tradeoff: Empirical loss vs. “Effective” partition complexity Lower Complexity Higher Complexity
25
Application Replace with a trade-off MovieLens dataset –100,000 ratings on 5-star scale –80,000 train ratings, 20,000 test ratings –943 viewers x 1682 movies –State-of-the-art Mean Absolute Error (0.72) –The optimal performance is achieved even with 300x300 cluster space
26
13x6 Clusters Mean Absolute Error Bound Test MAE
27
50x50 Clusters Test MAE Bound Test MAE Mean Absolute Error
28
283x283 Clusters Bound Test MAE Mean Absolute Error
29
PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h H is a function h:X Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z}
30
PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h H is a function h:X Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z} –P – prior over H, Q – posterior over H –p Q (Z)=E Q(h) p h (Z)
31
PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h H is a function h:X Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z} –P – prior over H, Q – posterior over H –p Q (Z)=E Q(h) p h (Z) PAC-Bayesian Bound for Density Estimation: –With probability ≥ 1-δ for all Q simultaneously:
32
PAC-Bayesian bound for classification Density estimation: with probability ≥ 1- : Classification = density estimation of the error variable If Z is the error variable, then |Z|=2 and p Q (Z)=L(Q):
33
Proof Idea Change of measure inequality: E Q(h) (h) ≤ D(Q||P) + ln E P(h) e (h)
34
Proof Idea Change of measure inequality: E Q(h) (h) ≤ D(Q||P) + ln E P(h) e (h) Choose: Show that E S e (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N }
35
Proof Idea Change of measure inequality: E Q(h) (h) ≤ D(Q||P) + ln E P(h) e (h) Choose: Show that E S e (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N } E S E P(h) e (h) = E P(h) E S e (h) ≤ (N+1) |Z|-1
36
Proof Idea Change of measure inequality: E Q(h) (h) ≤ D(Q||P) + ln E P(h) e (h) Choose: Show that E S e (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N } E S E P(h) e (h) = E P(h) E S e (h) ≤ (N+1) |Z|-1 By Markov’s inequality w.p. ≥ 1- :
37
A Bound on
40
Density Estimation with Co-clustering X2X2 X1X1 Given: Estimate: p(X 1,X 2 ) Minimize: –E p(X 1,X 2 ) ln q(X 1,X 2 )
41
Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 Model: Q = {q(C 1 |X 1 ),q(C 2 |X 2 )} X1X1 X2X2 C1C1 C2C2
42
Estimator: Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 X1X1 X2X2 C1C1 C2C2
43
Model: Q = {q(C 1 |X 1 ),q(C 2 |X 2 )} With probability ≥ 1- : Density Estimation with Co-clustering X2X2 X1X1
44
With probability ≥ 1- : Related work –Information-Theoretic Co-clustering [Dhillon et. al. ’03] : maximize I(C 1 ;C 2 ) alone –PAC-Bayesian approach provides regularization and model order selection Density Estimation with Co-clustering
45
Graphical Models Discriminative Prediction X1X1 X2X2 C1C1 C2C2 Y X1X1 X2X2 C1C1 C2C2 Density Estimation
46
Tree-Shaped Graphical Models Generalization bound: –Trade-off between empirical performance and the mutual information propagated up the tree Future work: –Optimization algorithms –Applications –More general graphical models C1C1 D1D1 X3X3 D2D2 Y C2C2 C3C3 C4C4 X2X2 X4X4 X1X1
47
Graph Clustering, Pairwise Clustering The weights of the links are generated according to: q(w ij |X i,X j ) = C a,C b q(w ij |C a,C b ) q(C a |X i ) q(C b |X j ) This is the co-clustering model with shared q(C|X) –Same bounds and algorithms apply
48
Summary of main contributions PAC-Bayesian analysis of unsupervised learning –Co-clustering –Tree-shaped graphical models –Graph clustering, pairwise clustering PAC-Bayesian bound for discrete density estimation Combinatorial priors mutual information regularization terms
49
Future Directions PAC-Bayesian analysis of unsupervised learning –Clustering –Continuous density estimation –General graphical models PAC-Bayesian analysis of reinforcement learning References Seldin and Tishby ICML 2008, AISTATS 2009, JMLR 2010
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.