Download presentation
Presentation is loading. Please wait.
1
1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
2
2 Problem Definition Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences …
3
3 Problem Definition Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large matrices
4
4 Closely Related Work Information Theoretic Co-clustering [Dhillon+/2003] Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs
5
5 Other Related Work K-means and variants: [Pelleg+/2000, Hamerly+/2003] “Frequent itemsets”: [Agrawal+/1994] Information Retrieval: [Deerwester+1990, Hoffman/1999] Graph Partitioning: [Karypis+/1998] Do not cluster rows and cols simultaneously User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters
6
6 What makes a cross-association “good”? versus Column groups Row groups Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies
7
7 Main Idea Good Compression Good Clustering implies Column groups Row groups p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi Binary Matrix +Σi+Σi
8
8 Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi
9
9 What makes a cross-association “good”? Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi versus Column groups Row groups
10
10 Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups
11
11 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost
12
12 Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost
13
13 Fixed k and l Column groups Row groups Shuffles: for each row: shuffle it to the row group which minimizes the code cost
14
14 Fixed k and l Column groups Row groups Ditto for column shuffles … and repeat …
15
15 Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- association Lower the encoding cost Find good groups for fixed k and l
16
16 Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1
17
17 Choosing k and l l = 5 k = 5 Split: Similar for column groups too.
18
18 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- association Lower the encoding cost Shuffles Splits
19
19 Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise
20
20 Experiments “Quasi block-diagonal” graph with Zipfian sizes, noise=10% l = 8 col groups k = 6 row groups
21
21 Experiments “White Noise” graph: we find the existing spurious patterns l = 3 col groups k = 2 row groups
22
22 Experiments “CLASSIC” 3,893 documents 4,303 words 176,347 “dots” Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Documents Words
23
23 Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words
24
24 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient
25
25 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CISI (Information Retrieval) providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies
26
26 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CRANFIELD (aerodynamics) shape, nasa, leading, assumed, thin CISI (Information Retrieval)
27
27 Experiments “CLASSIC” graph of documents & words: k=15, l=19 MEDLINE (medical) CISI (IR) CRANFIELD (aerodynamics) paint, examination, fall, raise, leave, based
28
28 Experiments NSF Grant Proposals Words in abstract “GRANTS” 13,297 documents 5,298 words 805,063 “dots”
29
29 Experiments “GRANTS” graph of documents & words: k=41, l=28 NSF Grant Proposals Words in abstract
30
30 Experiments “GRANTS” graph of documents & words: k=41, l=28 The Cross-Associations refer to topics: Genetics Physics Mathematics …
31
31 Experiments “Who-trusts-whom” graph from epinions.com: k=18, l=16 Epinions.com user
32
32 Experiments Number of “dots” Time (secs) Splits Shuffles Linear on the number of “dots”: Scalable
33
33 Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large matrices
34
34 Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.
35
35 Experiments
36
36 Experiments “Clickstream” graph of users and websites: k=15, l=13 Users Webpages
37
37 Fixed k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l swaps
38
38 Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise
39
39 Aim Given any binary matrixa “good” cross-association will have low cost But how can we find such a cross-association? l = 5 col groups k = 5 row groups
40
40 Main Idea size i * H(p i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering implies Minimize the total cost
41
41 Main Idea How well does a cross-association compress the matrix? Encode the matrix in a lossless fashion Compute the encoding cost Low encoding cost good compression good clustering Good Compression Better Clustering implies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.