DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.

Slides:



Advertisements
Similar presentations
Lecture 24 MAS 714 Hartmut Klauck
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.
CMU SCS Copyright: C. Faloutsos (2012)# : Multimedia Databases and Data Mining Lecture #27: Graph mining - Communities and a paradox Christos.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
(0,1)-Matrices If the row-sums of a matrix A are r 1, …,r k, then we shall call the vector r:=(r 1,r 2, …,r k ) the row-sum of A, and similarly for the.
Introduction to Bioinformatics
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
Fundamental limits in Information Theory Chapter 10 :
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
1Causality & MDL Causal Models as Minimal Descriptions of Multivariate Systems Jan Lemeire June 15 th 2006.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
Information Theory Eighteenth Meeting. A Communication Model Messages are produced by a source transmitted over a channel to the destination. encoded.
1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
Copyright © Cengage Learning. All rights reserved.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
1 Introduction to Quantum Information Processing QIC 710 / CS 768 / PH 767 / CO 681 / AM 871 Richard Cleve QNC 3129 Lecture 18 (2014)
Machine Learning Chapter 3. Decision Tree Learning
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu
DATA MINING LECTURE 8 Clustering Validation Minimum Description Length Information Theory Co-Clustering.
On Data Mining, Compression, and Kolmogorov Complexity. C. Faloutsos and V. Megalooikonomou Data Mining and Knowledge Discovery, 2007.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
(Important to algorithm analysis )
Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Coding Theory Efficient and Reliable Transfer of Information
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
DATA MINING LECTURE 10 Minimum Description Length Information Theory Co-Clustering.
Huffman Encodings Section 9.4. Data Compression: Array Representation Σ denotes an alphabet used for all strings Each element in Σ is called a character.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
CSCE350 Algorithms and Data Structure Lecture 19 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Vector Quantization Vector quantization is used in many applications such as image and voice compression, voice recognition (in general statistical pattern.
Vector Quantization CAP5015 Fall 2005.
Combinatorics (Important to algorithm analysis ) Problem I: How many N-bit strings contain at least 1 zero? Problem II: How many N-bit strings contain.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.
1Causal Performance Models Causal Models for Performance Analysis of Computer Systems Jan Lemeire TELE lab May 24 th 2006.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Today Cluster Evaluation Internal External
EE465: Introduction to Digital Image Processing
Subject Name: File Structures
Minimum Description Length Information Theory Co-Clustering
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Large Graph Mining: Power Tools and a Practitioner’s guide
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning Chapter 3. Decision Tree Learning
Algorithmic Complexity and Random Strings
Presentation transcript:

DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering

MINIMUM DESCRIPTION LENGTH

Occam’s razor Most data mining tasks can be described as creating a model for the data E.g., the EM algorithm models the data as a mixture of Gaussians, the K-means models the data as a set of centroids. Model vs Hypothesis What is the right model? Occam’s razor: All other things being equal, the simplest model is the best. A good principle for life as well

Occam's Razor and MDL What is a simple model? Minimum Description Length Principle: Every model provides a (lossless) encoding of our data. The model that gives the shortest encoding (best compression) of the data is the best. Related: Kolmogorov complexity. Find the shortest program that produces the data (uncomputable). MDL restricts the family of models considered Encoding cost: cost of party A to transmit to party B the data.

Minimum Description Length (MDL) The description length consists of two terms The cost of describing the model (model cost) The cost of describing the data given the model (data cost). L(D) = L(M) + L(D|M) There is a tradeoff between the two costs Very complex models describe the data in a lot of detail but are expensive to describe Very simple models are cheap to describe but require a lot of work to describe the data given the model

6 Example Regression: find the polynomial for describing the data Complexity of the model vs. Goodness of fit Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications. Low model cost High data cost High model cost Low data cost Low model cost Low data cost MDL avoids overfitting automatically!

MDL and Data Mining Why does the shorter encoding make sense? Shorter encoding implies regularities in the data Regularities in the data imply patterns Patterns are interesting Example Short description length, just repeat 12 times Random sequence, no patterns, no compression

MDL and Clustering If we have a clustering of the data, we can transmit the clusters instead of the data We need to transmit the description of the clusters And the data within each cluster. If we have a good clustering the transmission cost is low Why? What happens if all elements of the cluster are identical? What happens if we have very few elements per cluster? Homogeneous clusters are cheaper to encode But we should not have too many

Issues with MDL What is the right model family? This determines the kind of solutions that we can have E.g., polynomials Clusterings What is the encoding cost? Determines the function that we optimize Information theory

INFORMATION THEORY A short introduction

Encoding Consider the following sequence AAABBBAAACCCABACAABBAACCABAC Suppose you wanted to encode it in binary form, how would you do it? A  0 B  10 C  11 A is 50% of the sequence We should give it a shorter representation 50% A 25% B 25% C This is actually provably the best encoding!

Encoding A  0 B  10 C  11 Uniquely directly decodable For every code we can find a prefix code of equal length

Entropy

Claude Shannon Father of Information Theory Envisioned the idea of communication of information with 0/1 bits Introduced the word “bit” The word entropy was suggested by Von Neumann Similarity to physics, but also “nobody really knows what entropy really is, so in any conversation you will have an advantage”

Some information theoretic measures

USING MDL FOR CO-CLUSTERING (CROSS-ASSOCIATIONS) Thanks to Spiros Papadimitriou.

Co-clustering Simultaneous grouping of rows and columns of a matrix into homogeneous groups % 96% 3% Product groups Customer groups Products 54% Customers Students buying books CEOs buying BMWs

Co-clustering Step 1: How to define a “good” partitioning? Intuition and formalization Step 2: How to find it?

Co-clustering Intuition versus Column groups Row groups Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies

log * k + log * ℓ  log n i m j   i,j n i m j H(p i,j ) Co-clustering MDL formalization—Cost objective n1n1 n2n2 n3n3 m1m1 m2m2 m3m3 p 1,1 p 1,2 p 1,3 p 2,1 p 2,2 p 2,3 p 3,3 p 3,2 p 3,1 n × m matrix k = 3 row groups ℓ = 3 col. groups density of ones n 1 m 2 H(p 1,2 ) bits for (1,2) data cost bits total row-partition description col-partition description  i,j transmit #ones e i,j + + model cost + block size entropy + transmit #partitions

Co-clustering MDL formalization—Cost objective code cost (block contents) description cost (block structure) + one row group one col group n row groups m col groups low highlow high  

Co-clustering MDL formalization—Cost objective code cost (block contents) description cost (block structure) + k = 3 row groups ℓ = 3 col groups low

Co-clustering MDL formalization—Cost objective k ℓ total bit cost Cost vs. number of groups one row group one col group n row groups m col groups k = 3 row groups ℓ = 3 col groups

Co-clustering Step 1: How to define a “good” partitioning? Intuition and formalization Step 2: How to find it?

Search for solution Overview: assignments w/ fixed number of groups (shuffles) row shufflecolumn shufflerow shuffleoriginal groups No cost improvement: Discard reassign all rows, holding column assignments fixed reassign all columns, holding row assignments fixed

Search for solution Overview: assignments w/ fixed number of groups (shuffles) row shufflecolumn shuffle row shufflecolumn shuffle No cost improvement: Discard Final shuffle result

Search for solution Shuffles Let denote row and col. partitions at the I -th iteration Fix  I and for every row x : Splice into ℓ parts, one for each column group Let j, for j = 1,…,ℓ, be the number of ones in each part Assign row x to the row group i ¤   I+1 (x) such that, for all i = 1,…,k, p 1,1 p 1,2 p 1,3 p 2,1 p 2,2 p 2,3 p 3,3 p 3,2 p 3,1 Similarity (“KL-divergences”) of row fragments to blocks of a row group Assign to second row-group

k = 5, ℓ = 5 Search for solution Overview: number of groups k and ℓ (splits & shuffles)

col. split shuffle Search for solution Overview: number of groups k and ℓ (splits & shuffles) k=1, ℓ=2k=2, ℓ=2k=2, ℓ=3k=3, ℓ=3k=3, ℓ=4k=4, ℓ=4k=4, ℓ=5 k = 1, ℓ = 1 row split shuffle Split: Increase k or ℓ Shuffle: Rearrange rows or cols col. split shuffle row split shuffle col. split shuffle row split shuffle col. split shuffle k = 5, ℓ = 5 row split shuffle k = 5, ℓ = 6 col. split shuffle No cost improvement: Discard row split k = 6, ℓ = 5

Search for solution Overview: number of groups k and ℓ (splits & shuffles) k=1, ℓ=2k=2, ℓ=2k=2, ℓ=3k=3, ℓ=3k=3, ℓ=4k=4, ℓ=4k=4, ℓ=5 k = 1, ℓ = 1 Split: Increase k or ℓ Shuffle: Rearrange rows or cols k = 5, ℓ = 5 Final result

Co-clustering CLASSIC CLASSIC corpus 3,893 documents 4,303 words 176,347 “dots” (edges) Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words

Graph co-clustering CLASSIC Documents Words “CLASSIC” graph of documents & words: k = 15, ℓ = 19

Co-clustering CLASSIC MEDLINE (medical) insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient “CLASSIC” graph of documents & words: k = 15, ℓ = 19 CISI (Information Retrieval) providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies shape, nasa, leading, assumed, thin paint, examination, fall, raise, leave, based CRANFIELD (aerodynamics)

Co-clustering CLASSIC Document cluster # Document classPrecision CRANFIELDCISIMEDLINE Recall