PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.

PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby

Motivation We do not cluster the data just for the sake of it, but rather to facilitate a solution of some higher level task The quality of clustering should be evaluated by its contribution to the solution of that task The main obstacle in the development of unsupervised learning is the absence of a good formulation of a context of its application

Outline Two problems behind co-clustering –Discriminative prediction –Density estimation PAC-Bayesian analysis of discriminative prediction with co-clustering –Combinatorial priors PAC-Bayesian bound for discrete density estimation –Density estimation with co-clustering Extensions

Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Y Y Y X 2 (movies) X 1 (viewers)

Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Evaluation: Y Y Y X 2 (movies) X 1 (viewers) Expectation w.r.t. the true distribution p(X 1,X 2,Y) Expectation w.r.t. the classifier q(Y|X 1,X 2 ) Given loss l(Y,Y’)

Discriminative Prediction with Co-clustering Example: collaborative filtering Goal: find discriminative prediction rule q(Y|X 1,X 2 ) Evaluation: Model-independent comparison Y Y Y X 2 (movies) X 1 (viewers)

Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) X 2 (words) X 1 (documents)

Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) Evaluation: X 2 (words) X 1 (documents) The true distribution p(X 1,X 2 )

Density Estimation with Co-clustering Example: words-documents co- occurrence data Goal: find an estimator q(X 1,X 2 ) for the joint distribution p(X 1,X 2 ) Evaluation: Model-independent comparison X 2 (words) X 1 (documents)

Discriminative prediction based on co-clustering Y Y Y q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) q(Y|C1,C2)q(Y|C1,C2) X2X2 X1X1 Model: X1X1 X2X2 C1C1 C2C2 Y

PAC-Bayesian Analysis H – all hard partitions + labels for partition cells P – combinatorial (next slide) Q – {q(C 1 |X 1 ), q(C 2 |X 2 ), q(Y|C 1,C 2 )} q(x i,c i ) = 1/|X i | q(c i |x i ) Y Y Y q (C1|X1)q (C1|X1) q (C2|X2)q (C2|X2) q(Y|C1,C2)q(Y|C1,C2) X2X2 X1X1

Prior Construction |X i | possibilities to choose |C i | X2X2 X1X1

Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i X2X2 X1X1

Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s

Prior Construction |X i | possibilities to choose |C i | ≤ |X i | |C i |-1 possibilities to choose a cardinality profile along dimension i possibilities to assign X i -s to C i -s |Y| |C 1 ||C 2 | possibilities to assign labels to partition cells X2X2 X1X1 Y

Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i )

Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q)

Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤  i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y|

Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤  i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y| -H(Q) ≤ -|X i |H(C i |X i )

Calculation of D(Q||P) q(X i,C i ) = 1/|X i | q(C i |X i )  q(C i )=1/|X i |  x i q(C i |x i ) D(Q||P) = E Q(h) -lnP(h) – H(Q) E Q(h) -lnP(h) ≤  i [ |X i |H(C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y| -H(Q) ≤ -|X i |H(C i |X i ) D(Q||P) ≤  i [ |X i |I(X i ;C i )+|C i |ln|X i |]+|C 1 ||C 2 |ln|Y|

Generalization Bound With probability ≥ 1-  : Logarithmic in |X i | Number of partition cells PAC-Bayesian bound part

Generalization Bound With probability ≥ 1-  : Low Complexity I(X i ;C i ) = 0 High Complexity I(X i ;C i ) = ln|X i |

Generalization Bound With probability ≥ 1-  : Optimization tradeoff: Empirical loss vs. “Effective” partition complexity Lower Complexity Higher Complexity

Application Replace with a trade-off MovieLens dataset –100,000 ratings on 5-star scale –80,000 train ratings, 20,000 test ratings –943 viewers x 1682 movies –State-of-the-art Mean Absolute Error (0.72) –The optimal performance is achieved even with 300x300 cluster space

13x6 Clusters Mean Absolute Error Bound Test MAE 

50x50 Clusters Test MAE Bound Test MAE Mean Absolute Error 

283x283 Clusters  Bound Test MAE Mean Absolute Error

PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h  H is a function h:X  Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z}

PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h  H is a function h:X  Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z} –P – prior over H, Q – posterior over H –p Q (Z)=E Q(h) p h (Z)

PAC-Bayesian Bound for Discrete Density Estimation –X – sample space, H – hypothesis space –Each h  H is a function h:X  Z; Z - finite –p h (Z) = P X~p(X) {h(X)=Z} –P – prior over H, Q – posterior over H –p Q (Z)=E Q(h) p h (Z) PAC-Bayesian Bound for Density Estimation: –With probability ≥ 1-δ for all Q simultaneously:

PAC-Bayesian bound for classification Density estimation: with probability ≥ 1-  : Classification = density estimation of the error variable If Z is the error variable, then |Z|=2 and p Q (Z)=L(Q):

Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h)

Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h) Choose: Show that E S e  (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N }

Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h) Choose: Show that E S e  (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N } E S E P(h) e  (h) = E P(h) E S e  (h) ≤ (N+1) |Z|-1

Proof Idea Change of measure inequality: E Q(h)  (h) ≤ D(Q||P) + ln E P(h) e  (h) Choose: Show that E S e  (h) ≤ (N+1) |Z|-1 S = {X 1,..,X N } E S E P(h) e  (h) = E P(h) E S e  (h) ≤ (N+1) |Z|-1 By Markov’s inequality w.p. ≥ 1-  :

A Bound on

Density Estimation with Co-clustering X2X2 X1X1 Given: Estimate: p(X 1,X 2 ) Minimize: –E p(X 1,X 2 ) ln q(X 1,X 2 )

Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 Model: Q = {q(C 1 |X 1 ),q(C 2 |X 2 )} X1X1 X2X2 C1C1 C2C2

Estimator: Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 X1X1 X2X2 C1C1 C2C2

Model: Q = {q(C 1 |X 1 ),q(C 2 |X 2 )} With probability ≥ 1-  : Density Estimation with Co-clustering X2X2 X1X1

With probability ≥ 1-  : Related work –Information-Theoretic Co-clustering [Dhillon et. al. ’03] : maximize I(C 1 ;C 2 ) alone –PAC-Bayesian approach provides regularization and model order selection Density Estimation with Co-clustering

Graphical Models Discriminative Prediction X1X1 X2X2 C1C1 C2C2 Y X1X1 X2X2 C1C1 C2C2 Density Estimation

Tree-Shaped Graphical Models Generalization bound: –Trade-off between empirical performance and the mutual information propagated up the tree Future work: –Optimization algorithms –Applications –More general graphical models C1C1 D1D1 X3X3 D2D2 Y C2C2 C3C3 C4C4 X2X2 X4X4 X1X1

Summary of main contributions PAC-Bayesian analysis of unsupervised learning –Co-clustering –Tree-shaped graphical models –Graph clustering, pairwise clustering PAC-Bayesian bound for discrete density estimation Combinatorial priors  mutual information regularization terms

Future Directions PAC-Bayesian analysis of unsupervised learning –Clustering –Continuous density estimation –General graphical models PAC-Bayesian analysis of reinforcement learning References Seldin and Tishby ICML 2008, AISTATS 2009, JMLR 2010

PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.

Similar presentations

Presentation on theme: "PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.

Similar presentations

Presentation on theme: "PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby."— Presentation transcript:

Similar presentations

About project

Feedback