Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled.

Dino Ienco, Ruggero G. Pensa and Rosa Meo {ienco,pensa,meo}@di.unito.it University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia)

Motivations Our Idea Co-Clustering and Background Hierachical Co-Clustering Results Conclusions Outline

ECML-PKDD 2009 – Bled (Slovenia) Co-Clustering: - effective approach that obtains interesting results - Commonly involved with high-dimensional data - Partition simultaneously rows and columns MOTIVATIONS Motivations

ECML-PKDD 2009 – Bled (Slovenia) Many Co-clustering algorithms: - Spectral approach ( Dhillon et al. KDD01 ) - Information theoretic approach ( Dhillon et al. KDD03 ) - Minimum Sum-Squared Residue approach ( Cho et al. SDM04 ) - Bayesian approach ( Shan et al. ICDM08 ) MOTIVATIONS

ECML-PKDD 2009 – Bled (Slovenia) All previous techniques: - require num. of row/column cluster as parameter - produce flat partitions, without any structure information MOTIVATIONS

ECML-PKDD 2009 – Bled (Slovenia) MOTIVATIONS In general: - parameters are difficult to set - structured output (like hierarchies) help the user to understand data Hierarchical structures are useful to: - indexing and visualize data - explore the parent-child relationships - derive generalization/specialization concept

ECML-PKDD 2009 – Bled (Slovenia) OUR IDEA PROPOSED APPROACH: - Extend previous flat co-clustering algorithm ( Robardet02 ) to hierarchical setting CO-CLUSTERING + HIERARCHICAL APPROACH Build two hierarchies on both dimensions SIMULTANEOUSLY ALLOWS Our Idea

University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia) Background τ-CoClust (Robardet02): - Co-Clustering for counting or frequency data - No number of row/column clustering needed - Maximize a statistical measure Goodman and Kruskal τ between row and column partitions CO-CLUSTERING

University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia) CO-CLUSTERING Goodman and Kruskal τ : - Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable F1F1 F2F2 F3F3 O1O1 d 11 d 12 d 13 O2O2 d 21 d 22 d 23 O3O3 d 31 d 32 d 33 O4O4 d 41 d 42 d 43 CF 1 CF 2 CO 1 t 11 t 12 TO 1 CO 2 t 21 t 22 TO 2 TF 1 TF 2 CO 1 = {O 1,O 2 }CF 1 ={F 2 } CO 2 = {O 3,O 4 }CF 2 = {F 1,F 3 }

University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia) CO-CLUSTERING Goodman and Kruskal τ : - Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable Prediction error on CO without knowledge about CF partition E CO E CO |CF Prediction error on CO with knowledge about CF partition

ECML-PKDD 2009 – Bled (Slovenia) CO-CLUSTERING Optimization strategy: - τ is asymmetrical, for this reason the algorithm alternates the optimization of two functions τ CO|CF and τ CF|CO - Stochastic optimization (example on rows): # Start with an initial parition on rows for i in 1..n_times # augment the current partition with an empty cluster # Move at random one element from a partition to another one # If obj. func. improve keep solution, else undo the operation # If there is an empty cluster, remove it end - This optimization allows the num. of clusters to grow or decrease - In (Robardet02) an efficient way to update incrementally the objective function was introduced

ECML-PKDD 2009 – Bled (Slovenia) HIERARCHICAL CO-CLUSTERING HiCC: - Hierarchical Co-Clustering algorithm that extends τ- CoClust - Divisive Approach - No parameter settings needed - No predefined number of splits for each node of the hierarchy HIERARCHICAL CO-CLUSTERING

ECML-PKDD 2009 – Bled (Slovenia) HiCC: At the beginning use τ-CoClust repeat - From the current Row/Column partitions - Fix the Column partition - For each cluster in the Row partition Re-cluster with τ-CoClust and optimize the obj. func. τ CO|CF - Update Row Hierarchy - Fix the new Row partition - For each cluster in the Column partition Re-cluster with τ-CoClust and optimize the obj. func. τ CF|newCO - Update Column Hierarchy until (TERMINATION) HIERARCHICAL CO-CLUSTERING

ECML-PKDD 2009 – Bled (Slovenia) HIERARCHICAL CO-CLUSTERING A SIMPLE EXAMPLE Goes on … until termination condition is satisfied

ECML-PKDD 2009 – Bled (Slovenia) RESULTS Experimentation: - No previous hierarchical co-clustering algorithm exists - Use a flat co-clustering algorithm with the same number of clusters obtained by our approach for each level - We choose Information theoretic approach ( KDD03 ) and for each level we perform 50 runs then we compute the average - We use document-word dataset to validate our approach: * OHSUMED (collection of pubmed abstract) {oh0, oh15} * REUTERS-21578 ( collected and labeled by Carnegie Group ) {re0, re1} * TREC (Text Retrieval Conference) {tr11, tr21}

ECML-PKDD 2009 – Bled (Slovenia) An example of row hierachy on OHSUMED Enzyme-Activation Cell-Movement Adenosine-Diphosphate Staphylococcal-Infection Uremia Staphylococcal-Infection Memory We label each cluster with the majority class RESULTS

ECML-PKDD 2009 – Bled (Slovenia) An example of column hierachy on REUTERS oil, compani, opec, gold, ga, barrel, strike, mine, lt, explor tonne, wheate, sugar, corn, mln, crop, grain, agricultur, usda, soybean coffee, buffer, cocoa, deleg, consum, ico, stock, quota, icco, produc oil, opec, tax, price, dlr, crude, bank, industri, energi, saudi compani, gold, mine, barrel, strike, ga, lt, ounce, ship, explor tonne, wheate, sugar, corn, grain, crop, agricultur, usda, soybean, soviet mln, export, farm, ec, import, market, total, sale, trader, trade quota, stock, produc, meet, intern, talk, bag, agreem, negoti, brazil coffee, deleg, buffer, cocoa, consum, ico, icco, pact, council, rubber We label each cluster with top 10 words ranked by mutual information RESULTS

ECML-PKDD 2009 – Bled (Slovenia) External Validation Indices: - Purity - Normalized Mutual Information (NMI) - Adjusted Rand Index Hierarchical setting: We combine the hierarchical result with this formula - is one of the external validation indices - is a weight for the hierarchy level i, in our case α i is equal to 1/i RESULTS

ECML-PKDD 2009 – Bled (Slovenia) RESULTS Performance Results

ECML-PKDD 2009 – Bled (Slovenia) RESULTS Performance Results on re1 dataset

ECML-PKDD 2009 – Bled (Slovenia) Conclusions We propose: - New approach to hierarchical co-clustering - Parameter free - No apriori fixed number of splits (n-ary splits) - Obtains good results - Builds simultaneously hierarchies on both dimensions - Improve co-clustering results exploration CONCLUSIONS

ECML-PKDD 2009 – Bled (Slovenia) Future works: - Parallelize the algorithm to improve time performance - Pushing constraints inside it to use background knowledge - Extend the framework to manage continuous data CONCLUSIONS

ECML-PKDD 2009 – Bled (Slovenia) Any Question? Thank you for your attention

Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled.

Similar presentations

Presentation on theme: "Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled.

Similar presentations

Presentation on theme: "Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled."— Presentation transcript:

Similar presentations

About project

Feedback