1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

2 Problem Definition Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences …

3 Problem Definition Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large matrices

4 Closely Related Work Information Theoretic Co-clustering [Dhillon+/2003]  Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs

5 Other Related Work K-means and variants: [Pelleg+/2000, Hamerly+/2003] “Frequent itemsets”: [Agrawal+/1994] Information Retrieval: [Deerwester+1990, Hoffman/1999] Graph Partitioning: [Karypis+/1998] Do not cluster rows and cols simultaneously User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

6 What makes a cross-association “good”? versus Column groups Row groups Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies

7 Main Idea Good Compression Good Clustering implies Column groups Row groups p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi Binary Matrix +Σi+Σi

8 Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

9 What makes a cross-association “good”? Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi versus Column groups Row groups

10 Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups

11 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost

12 Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost

13 Fixed k and l Column groups Row groups Shuffles: for each row: shuffle it to the row group which minimizes the code cost

14 Fixed k and l Column groups Row groups Ditto for column shuffles … and repeat …

15 Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- association Lower the encoding cost Find good groups for fixed k and l

16 Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1

17 Choosing k and l l = 5 k = 5 Split: Similar for column groups too.

18 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost Shuffles Splits

19 Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

20 Experiments “Quasi block-diagonal” graph with Zipfian sizes, noise=10% l = 8 col groups k = 6 row groups

21 Experiments “White Noise” graph: we find the existing spurious patterns l = 3 col groups k = 2 row groups

22 Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots” Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words

23 Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words

24 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient

25 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CISI (Information Retrieval) providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies

26 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CRANFIELD (aerodynamics) shape, nasa, leading, assumed, thin CISI (Information Retrieval)

27 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CISI (IR) CRANFIELD (aerodynamics) paint, examination, fall, raise, leave, based

28 Experiments NSF Grant Proposals Words in abstract “GRANTS” 13,297 documents 5,298 words 805,063 “dots”

29 Experiments “GRANTS” graph of documents & words: k=41, l=28 NSF Grant Proposals Words in abstract

30 Experiments “GRANTS” graph of documents & words: k=41, l=28 The Cross-Associations refer to topics: Genetics Physics Mathematics …

31 Experiments “Who-trusts-whom” graph from epinions.com: k=18, l=16 Epinions.com user

32 Experiments Number of “dots” Time (secs) Splits Shuffles Linear on the number of “dots”: Scalable

33 Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices

34 Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.

35 Experiments

36 Experiments “Clickstream” graph of users and websites: k=15, l=13 Users Webpages

37 Fixed k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l swaps

38 Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise

39 Aim Given any binary matrixa “good” cross-association will have low cost But how can we find such a cross-association? l = 5 col groups k = 5 row groups

40 Main Idea size i * H(p i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering implies Minimize the total cost

41 Main Idea How well does a cross-association compress the matrix?  Encode the matrix in a lossless fashion  Compute the encoding cost  Low encoding cost  good compression  good clustering Good Compression Better Clustering implies

1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Similar presentations

Presentation on theme: "1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Similar presentations

Presentation on theme: "1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)"— Presentation transcript:

Similar presentations

About project

Feedback