1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Slides:



Advertisements
Similar presentations
Mining Association Rules from Microarray Gene Expression Data.
Advertisements

An Easy-to-Decode Network Coding Scheme for Wireless Broadcasting
CMU SCS Copyright: C. Faloutsos (2012)# : Multimedia Databases and Data Mining Lecture #27: Graph mining - Communities and a paradox Christos.
Using Sparse Matrix Reordering Algorithms for Cluster Identification Chris Mueller Dec 9, 2004.
Dimensionality Reduction PCA -- SVD
Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.
DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
Data Mining Techniques: Clustering
CMU SCS Copyright: C. Faloutsos (2014)# : Multimedia Databases and Data Mining Lecture #29: Graph mining - Generators & tools Christos Faloutsos.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Automated Characterization of cellular migration phenomena Christian Beaudry, Michael E. Berens, Anna M. Joy Translational Genomics Research Institute.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CEG 221 Lesson 5: Algorithm Development II Mr. David Lippa.
Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Cluster Analysis (1).
Fast Random Walk with Restart and Its Applications
1 Clustering Applications at Yahoo! Deepayan Chakrabarti
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.
DATA MINING LECTURE 8 Clustering Validation Minimum Description Length Information Theory Co-Clustering.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Victor Lee.  What are Social Networks?  Role and Position Analysis  Equivalence Models for Roles  Block Modelling.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Image Representation. Objectives  Bitmaps: resolution, colour depth and simple bitmap file calculations.  Vector graphics: drawing list – objects and.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
I MPROVING C O -C LUSTER Q UALITY WITH A PPLICATION TO P RODUCT R ECOMMENDATIONS Michail Vlachos et al. Distributed Application Systems Presentation by.
Introduction to Modern Symmetric-key Ciphers
Huffman Code and Data Decomposition Pranav Shah CS157B.
Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference.
DATA MINING LECTURE 10 Minimum Description Length Information Theory Co-Clustering.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
About Me Swaroop Butala  MSCS – graduating in Dec 09  Specialization: Systems and Databases  Interests:  Learning new technologies  Application of.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
CLUSTERING PARTITIONING METHODS Elsayed Hemayed Data Mining Course.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
CLASS INHERITANCE TREE (CIT)
Clustering Anna Reithmeir Data Mining Proseminar 2017
Minimum Description Length Information Theory Co-Clustering
Tools for Large Graph Mining
Introduction to Data Mining, 2nd Edition by
Large Graph Mining: Power Tools and a Practitioner’s guide
Discovering Functional Communities in Social Media
Concept Decomposition for Large Sparse Text Data Using Clustering
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
Group 9 – Data Mining: Data
Matrix Multiplication Sec. 4.2
Presentation transcript:

1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

2 Problem Definition Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences …

3 Problem Definition Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large matrices

4 Closely Related Work Information Theoretic Co-clustering [Dhillon+/2003]  Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs

5 Other Related Work K-means and variants: [Pelleg+/2000, Hamerly+/2003] “Frequent itemsets”: [Agrawal+/1994] Information Retrieval: [Deerwester+1990, Hoffman/1999] Graph Partitioning: [Karypis+/1998] Do not cluster rows and cols simultaneously User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

6 What makes a cross-association “good”? versus Column groups Row groups Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies

7 Main Idea Good Compression Good Clustering implies Column groups Row groups p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi Binary Matrix +Σi+Σi

8 Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

9 What makes a cross-association “good”? Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi versus Column groups Row groups

10 Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups

11 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost

12 Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost

13 Fixed k and l Column groups Row groups Shuffles: for each row: shuffle it to the row group which minimizes the code cost

14 Fixed k and l Column groups Row groups Ditto for column shuffles … and repeat …

15 Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- association Lower the encoding cost Find good groups for fixed k and l

16 Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1

17 Choosing k and l l = 5 k = 5 Split: Similar for column groups too.

18 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost Shuffles Splits

19 Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

20 Experiments “Quasi block-diagonal” graph with Zipfian sizes, noise=10% l = 8 col groups k = 6 row groups

21 Experiments “White Noise” graph: we find the existing spurious patterns l = 3 col groups k = 2 row groups

22 Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots” Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words

23 Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words

24 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient

25 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CISI (Information Retrieval) providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies

26 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CRANFIELD (aerodynamics) shape, nasa, leading, assumed, thin CISI (Information Retrieval)

27 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CISI (IR) CRANFIELD (aerodynamics) paint, examination, fall, raise, leave, based

28 Experiments NSF Grant Proposals Words in abstract “GRANTS” 13,297 documents 5,298 words 805,063 “dots”

29 Experiments “GRANTS” graph of documents & words: k=41, l=28 NSF Grant Proposals Words in abstract

30 Experiments “GRANTS” graph of documents & words: k=41, l=28 The Cross-Associations refer to topics: Genetics Physics Mathematics …

31 Experiments “Who-trusts-whom” graph from epinions.com: k=18, l=16 Epinions.com user

32 Experiments Number of “dots” Time (secs) Splits Shuffles Linear on the number of “dots”: Scalable

33 Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices

34 Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.

35 Experiments

36 Experiments “Clickstream” graph of users and websites: k=15, l=13 Users Webpages

37 Fixed k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l swaps

38 Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise

39 Aim Given any binary matrixa “good” cross-association will have low cost But how can we find such a cross-association? l = 5 col groups k = 5 row groups

40 Main Idea size i * H(p i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering implies Minimize the total cost

41 Main Idea How well does a cross-association compress the matrix?  Encode the matrix in a lossless fashion  Compute the encoding cost  Low encoding cost  good compression  good clustering Good Compression Better Clustering implies