Download presentation
Presentation is loading. Please wait.
1
Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
2
Problem Definition Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences …
3
Problem Definition Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large graphs
4
Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.
5
Related Work K-means and variants: “Frequent itemsets”: Information Retrieval: Graph Partitioning: Dimensionality curse Choosing the number of clusters User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters
6
What makes a cross-association “good”? versus Column groups Row groups Better Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies
7
Main Idea Good Compression Better Clustering implies Column groups Row groups p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi Binary Matrix +Σi+Σi
8
Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi
9
What makes a cross-association “good”? versus Column groups Row groups Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi
10
Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups
11
Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost
12
Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost
13
Fixed k and l Column groups Row groups Swaps: for each row: swap it to the row group which minimizes the code cost
14
Fixed k and l Column groups Row groups Ditto for column swaps … and repeat …
15
Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l
16
Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1
17
Choosing k and l l = 5 k = 5 Split: Similar for column groups too.
18
Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost Swaps Splits
19
Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise
20
Experiments “Caveman” graph with Zipfian cave sizes, noise=10% l = 8 col groups k = 6 row groups
21
Experiments “White Noise” graph l = 3 col groups k = 2 row groups
22
Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words
23
Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28
24
Experiments “Who-trusts-whom” graph from epinions.com: k=18, l=16 Epinions.com user
25
Experiments “Clickstream” graph of users and websites: k=15, l=13 Users Webpages
26
Experiments Number of non-zeros Time (secs) Splits Swaps Linear on the number of “ones”: Scalable
27
Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs
28
Fixed k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l swaps
29
Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise
30
Aim Given any binary matrixa “good” cross-association will have low cost But how can we find such a cross-association? l = 5 col groups k = 5 row groups
31
Main Idea size i * H(p i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering implies Minimize the total cost
32
Main Idea How well does a cross-association compress the matrix? Encode the matrix in a lossless fashion Compute the encoding cost Low encoding cost good compression good clustering Good Compression Better Clustering implies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.