Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Problem Definition Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences …

Problem Definition Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large graphs

Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.

Related Work K-means and variants: “Frequent itemsets”: Information Retrieval: Graph Partitioning: Dimensionality curse Choosing the number of clusters User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

What makes a cross-association “good”? versus Column groups Row groups Better Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies

Main Idea Good Compression Better Clustering implies Column groups Row groups p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi Binary Matrix +Σi+Σi

Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi

What makes a cross-association “good”? versus Column groups Row groups Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi

Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups

Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost

Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost

Fixed k and l Column groups Row groups Swaps: for each row: swap it to the row group which minimizes the code cost

Fixed k and l Column groups Row groups Ditto for column swaps … and repeat …

Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l

Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1

Choosing k and l l = 5 k = 5 Split: Similar for column groups too.

Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost Swaps Splits

Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

Experiments “Caveman” graph with Zipfian cave sizes, noise=10% l = 8 col groups k = 6 row groups

Experiments “White Noise” graph l = 3 col groups k = 2 row groups

Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words

Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28

Experiments “Who-trusts-whom” graph from epinions.com: k=18, l=16 Epinions.com user

Experiments “Clickstream” graph of users and websites: k=15, l=13 Users Webpages

Experiments Number of non-zeros Time (secs) Splits Swaps Linear on the number of “ones”: Scalable

Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs

Fixed k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l swaps

Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise

Aim Given any binary matrixa “good” cross-association will have low cost But how can we find such a cross-association? l = 5 col groups k = 5 row groups

Main Idea size i * H(p i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering implies Minimize the total cost

Main Idea How well does a cross-association compress the matrix?  Encode the matrix in a lossless fashion  Compute the encoding cost  Low encoding cost  good compression  good clustering Good Compression Better Clustering implies

Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Similar presentations

Presentation on theme: "Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Similar presentations

Presentation on theme: "Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)"— Presentation transcript:

Similar presentations

About project

Feedback