. Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

Slides:



Advertisements
Similar presentations
Xiaolong Wang and Daniel Khashabi
Advertisements

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Bregman Information Bottleneck NIPS’03, Whistler December 2003 Koby Crammer Hebrew University of Jerusalem Noam Slonim Princeton University.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Software Engineering Laboratory1 Introduction of Bayesian Network 4 / 20 / 2005 CSE634 Data Mining Prof. Anita Wasilewska Hiroo Kusaba.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.
Bayesian Networks Alan Ritter.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Scalable Text Mining with Sparse Generative Models
Thanks to Nir Friedman, HU
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
Causal Models, Learning Algorithms and their Application to Performance Modeling Jan Lemeire Parallel Systems lab November 15 th 2006.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
WordSieve: Learning Task Differentiating Keywords Automatically Travis Bauer Sandia National Laboratories (Research discussed today was done at Indiana.
. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.
Radial Basis Function Networks
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
University of Toronto Department of Computer Science CSC444 Lec05- 1 Lecture 5: Decomposition and Abstraction Decomposition When to decompose Identifying.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Randomized Algorithms for Bayesian Hierarchical Clustering
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Naïve Bayes Classification Christina Wallin Computer Systems Research Lab
Review of statistical modeling and probability theory Alan Moses ML4bio.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
A Consensus-Based Clustering Method
Hidden Markov Models Part 2: Algorithms
Instructor : Saeed Shiry & Mitchell Ch. 6
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Dimension reduction : PCA and Clustering
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Dr. Arslan Ornek MATHEMATICAL MODELS
Presentation transcript:

. Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew University School of Computer Science and Engineering

2 Multivariate Information Bottleneck - Preview - A general framework for specifying a new family of clustering problems - Almost all of these problems, are not treated by standard clustering approaches - Insights and demonstrations why these problems are important - A general optimal solution for all these problems, based on a single Information Theoretic principle - Applications for text analysis, gene expression data and more...

3 Multivariate IB – introduction u Second half starts here… u Maybe a temporary summary – a well defined method (formulated as a variational principle)… 3 different algorithmic approaches… however – it was limited for a specific optimization problem… but we could think of other problems (e.g. symmetric)… and in the following we will describe a lift-up of the first half for dealing with a much more rich family of problems… still work in progress… -Original IB: Compressing one variable while preserving the information about some other single variable

4 Multivariate IB – introduction (cont.) -However, we could think of other problems, e.g. symmetric compression: Question: How to formulate and solve all such problems under one unifying principle?

5 (a few words about …) Bayesian Networks -A Bayes net over (X 1,…,X n ) is a DAG G in which vertices correspond to the random variables - P(X 1,…,X n ) is consistent with G iff each X i is independent of all the other (non-descendant) variables, given its parents Pa i

6 Multi-information and Bayes nets -The information (X 1,…,X n ) contains about each other is captured by: -If P(X 1,…,X n ) is consistent with G then:

7 Original IB through Bayes net formulation New generalized formulation: Which in this case means: Constant What compresses what What predicts what

8 Alternative formulation: preliminaries For a given DAG G, define: P For P which is consistent with G in : Real multi-info in P(X,T) Multi-info as though P(X,T) is consistent with G out

9 Alternative formulation for original IB Which in this case means: Constant Actual distribution Desired independencies Alternative formulation: Minimize independencies violations

10 Comparing the two principles - Given G in, different choices of G out will yield different optimization problems… - Given G in and G out, each principle will yield different optimization problems… Original IB problem

11 Comparing the two principles (cont.)

12 Beyond the original IB [Slonim, Friedman, Tishby] G in dependencies (minimize) G out dependencies (maximize) Compression (Bottleneck) variables Input variables Parameters

13 A simple example: Symmetric IB What compresses whatWhat predicts what

14 A multivariate formal optimal solution -Where now d(Pa j,t j ) is a generalized (KL) distortion measure… - For example, in symmetric IB:

15 Multivariate IB algorithms – example for aIB [Slonim, Friedman, Tishby, 2002] W 1 W 2 W 3 W 4 W W N W 1 W 2 W 3, W 4 W W N W 1,W 2...W N W 1 W 2 W 3 W 4 W W N W 1 W 2 W 3,W 4 W W N W 1,W 2...W N -Which pair to merge? -Where now is a generalized (JS) distortion measure… - For example, in symmetric aIB:

16 Symmetric aIB compression: documents, words - Accuracy of symmetric aIB vs. original aIB over 3 small datasets: Word clusters provide a more robust representation…

17 Symmetric IB through Deterministic Annealing Data: 20,000 messages from 20 different discussion groups [Lang, 95] W – a word in the corpus C – the class (newsgroup) of the message P(W=‘bible’,C=‘alt.atheism’): Probability that choosing a random position in the corpus would select the word ‘bible’ in a message of the newsgroup (class) ‘alt.atheism’… Words Classes

18 Symmetric IB through Deterministic Annealing  Newsgroup Word

19 Symmetric IB through Deterministic Annealing alt.atheism rec.autos rec.motorcycles rec.sport.* sci.med sci.space soc.religion.christian talk.politics.* comp.* misc.forsale sci.crypt sci.electronics car turkish game team jesus gun hockey … x file image encryption window dos mac … Newsgroup Word  P(T C,T W )

20 Symmetric IB through Deterministic Annealing Newsgroup word comp.graphics comp.os.ms-windows.misc comp.windows.x comp.sys.ibm.pc.hardware comp.sys.mac.hardware misc.forsale sci.crypt sci.electronics windows image window jpeg graphics … encryption db ide escrow monitor …  P(T C,T W )

21 Symmetric IB through Deterministic Annealing Newsgroup word  P(T C,T W )

22 Symmetric IB through Deterministic Annealing Newsgroup word alt.atheism rec.sport.baseball rec.sport.hockey soc.religion.christian talk.politics.mideast talk.religion.misc rec.autos rec.motorcycles sci.med sci.space talk.politics.guns talk.politics.misc armenian turkish jesus hockey israeli armenians … car q gun bike fbi health …  P(T C,T W )

23 Symmetric IB through Deterministic Annealing Newsgroup Word  P(T C,T W )

24 Symmetric IB through Deterministic Annealing Newsgroup Word  P(T C,T W )

25 Symmetric IB through Deterministic Annealing Newsgroup Word atheists christianity jesus bible sin faith … alt.atheism soc.religion.christian talk.religion.misc  P(T C,T W )

26 Symmetric aIB compression: genes, samples Data: Gene expression of 500 “informative” genes Vs. 72 Leukemia samples (Golub et al, 1999) Genes Samples

27 Symmetric aIB compression: genes, samples ALL B-cell hosp1 ALL B-cell hosp1 ALL T-cell hosp1 Male BM B-cell BM B-cell AML hosp2 AML hosp3 10 Gene clusters 8 Sample clusters X00437_s_at M12886_at X76223_s_at M59807_at U23852_s_at D00749_s_at U89922_s_at X03934_at U50743_at M21624_at M28826_at M37271_s_at X59871_at X14975_at M16336_s_at L05148_at M28825_at Data after symmetric aIB compression:

28 Another example: parallel IB - Consider a document collection with different topics, and different writing styles: topic4 topic2 topic3 Science topic1

29 Another example: parallel IB (cont.) topic2 topic1 topic4 topic3 Topic1Topic2Topic3Topic4 -One possible “legitimate” partition is by the topic:

30 Another example: parallel IB (cont.) -And another possible “legitimate” partition is by the writing style: topic1 topic3 topic2 topic3 topic4 topic1 topic4 topic1 topic2 topic4 topic1 topic3 topic1 topic3 topic4 topic1 topic2 topic3 topic1 topic3 topic2 topic4 Style1Style2Style3 There might be more than one “legitimate” partition…

31 Parallel IB: solution Minimize dependenciesMaximize dependencies Effective distortion:

32 Parallel sIB: Text analysis results -Data: ~1,500 “documents” taken from E. R. Burroughs: The Beasts of Tarzan & The Gods of Mars R. Kipling: The Jungle Book & Rewards and Fairies - X 1 corresponds to “documents”, X 2 corresponds to words T 2,b T 2,a Burroughs Kipling 3670 Rewards and Fairies 2550 The Jungle Book 0407 The Gods of Mars 2315 The Beasts of Tarzan T 1,b T 1,a

33 Parallel sIB :Gene Expression data results - Data: Gene expression of 500 “informative” genes Vs. 72 Leukemia samples (Golub et al, 1999) - X 1 corresponds to samples, X 2 corresponds to genes T-cell 380 B-cell 470 ALL 223 AML T 1,b T 1,a T 2,b T 2,a T 3,b T 3,a T 4,b T 4,a

34 Another Example: Triplet IB -Consider the following sequence data: s(1) s(2) s(3) … s(t-1) s(t) s(t+1) … -Can we extract features s.t. their combination is informative about a symbol between them? XpXp XmXm XnXn TpTp TnTn

35 Triplet IB: solution Minimize dependenciesMaximize dependencies

36 Triplet IB Data (E. R. Burroughs, “Tarzan the Terrible”) “… As Tarzan ascended the platform his eyes narrowed angrily at the sight which met them… ‘’What means this?” he cried angrily…” 1 st word in triplet X p 2 nd word in triplet X m 3 rd word in triplet X n X m = {apemans, apes, eyes, girl, great, jungle, tarzan, time, two, way} Data: Tarzan and the Jewels of Opar, Tarzan of the Apes, Tarzan the Terrible, Tarzan the Untamed, The Beasts of Tarzan, The Jungle Tales of Tarzan, The Return of Tarzan Joint distribution P(X p,X m,X n ) of dimension 90 x 10 x 233

37 Triplet sIB: Text analysis results - Given X p and X n, two schemes to predict middle word: X m = argmax P( x m’ | t p,t n ) - Test on a NEW sequence, “The son of Tarzan”: 22%28%55%53% Average 21%28%81%60% Way (101) 8%11%92%41% Two (148) 26%48%82%70% Time (145) 25%40%67%41% Tarzan (48) 24%27%54%49% Jungle (241) 48%50%92% Great (219) 1%5%30%43% Girl (240) 28%32%81%83% Eyes (177) 14%17%26%43% Apes(78) X p, X n T p, T n X p, X n T p, T n XmXm Precision (%)Recall (%) X m = argmax P( x m’ | x p,x n )

38 Summary - The IB method is a principled framework, for extracting “informative” structure out of a joint distribution P(X1,X2). - The Multivariate IB extends this framework to extract “informative” structure from more complex joint distributions, P(X1,…,Xn), in various ways. - This enables us to define and solve a new family of optimization problems, under a single unifying Information Theoretic principle. - References: - “Clustering” conceals a family of distinct problems which deserve special consideration. The multivariate IB framework enables to define these sub-problems, solve them, and demonstrate their importance.