Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Consistent Bipartite Graph Co-Partitioning for High-Order Heterogeneous Co-Clustering Tie-Yan Liu WSM Group, Microsoft Research Asia Joint work.
ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Random Forest Predrag Radenković 3237/10
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Appendix to Chapter 1 Mathematics Used in Microeconomics © 2004 Thomson Learning/South-Western.
Spie98-1 Evolutionary Algorithms, Simulated Annealing, and Tabu Search: A Comparative Study H. Youssef, S. M. Sait, H. Adiche
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Lecture outline Clustering aggregation – Reference: A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004 Co-clustering (or bi-clustering)
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Copyright (c) 2000 by Harcourt, Inc. All rights reserved. Functions of One Variable Variables: The basic elements of algebra, usually called X, Y, and.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Rural Economy Research Centre AESI Student Day 05/11/2009 Examining the relationship between production costs and managerial ability P. Smyth 1, 2, L.
Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Dual Transfer Learning Mingsheng Long 1,2, Jianmin Wang 2, Guiguang Ding 2 Wei Cheng, Xiang Zhang, and Wei Wang 1 Department of Computer Science and Technology.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Consensus Group Stable Feature Selection
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Timing Model Reduction for Hierarchical Timing Analysis Shuo Zhou Synopsys November 7, 2006.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Recent Research about LTGA Lecturer:Yu-Fan Tung, R Advisor:Tian-Li Yu Date:May 8, 2014.
Efficient Semi-supervised Spectral Co-clustering with Constraints
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Hierarchical Co-Clustering Based on Entropy Splitting Wei Cheng 1 Xiang Zhang 2 Feng Pan 3 Wei Wang 4 1.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Prediction of Interconnect Net-Degree Distribution Based on Rent’s Rule Tao Wan and Malgorzata Chrzanowska- Jeske Department of Electrical and Computer.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
A Consensus-Based Clustering Method
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Roberto Battiti, Mauro Brunato
Presentation transcript:

Dino Ienco, Ruggero G. Pensa and Rosa Meo University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia)

Motivations Our Idea Co-Clustering and Background Hierachical Co-Clustering Results Conclusions Outline

ECML-PKDD 2009 – Bled (Slovenia) Co-Clustering: - effective approach that obtains interesting results - Commonly involved with high-dimensional data - Partition simultaneously rows and columns MOTIVATIONS Motivations

ECML-PKDD 2009 – Bled (Slovenia) Many Co-clustering algorithms: - Spectral approach ( Dhillon et al. KDD01 ) - Information theoretic approach ( Dhillon et al. KDD03 ) - Minimum Sum-Squared Residue approach ( Cho et al. SDM04 ) - Bayesian approach ( Shan et al. ICDM08 ) MOTIVATIONS

ECML-PKDD 2009 – Bled (Slovenia) All previous techniques: - require num. of row/column cluster as parameter - produce flat partitions, without any structure information MOTIVATIONS

ECML-PKDD 2009 – Bled (Slovenia) MOTIVATIONS In general: - parameters are difficult to set - structured output (like hierarchies) help the user to understand data Hierarchical structures are useful to: - indexing and visualize data - explore the parent-child relationships - derive generalization/specialization concept

ECML-PKDD 2009 – Bled (Slovenia) OUR IDEA PROPOSED APPROACH: - Extend previous flat co-clustering algorithm ( Robardet02 ) to hierarchical setting CO-CLUSTERING + HIERARCHICAL APPROACH Build two hierarchies on both dimensions SIMULTANEOUSLY ALLOWS Our Idea

University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia) Background τ-CoClust (Robardet02): - Co-Clustering for counting or frequency data - No number of row/column clustering needed - Maximize a statistical measure Goodman and Kruskal τ between row and column partitions CO-CLUSTERING

University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia) CO-CLUSTERING Goodman and Kruskal τ : - Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable F1F1 F2F2 F3F3 O1O1 d 11 d 12 d 13 O2O2 d 21 d 22 d 23 O3O3 d 31 d 32 d 33 O4O4 d 41 d 42 d 43 CF 1 CF 2 CO 1 t 11 t 12 TO 1 CO 2 t 21 t 22 TO 2 TF 1 TF 2 CO 1 = {O 1,O 2 }CF 1 ={F 2 } CO 2 = {O 3,O 4 }CF 2 = {F 1,F 3 }

University of Turin, Italy Department of Computer Science ECML-PKDD 2009 – Bled (Slovenia) CO-CLUSTERING Goodman and Kruskal τ : - Measure the proportional reduction in the prediction error of a dep. Variable given an indep. Variable Prediction error on CO without knowledge about CF partition E CO E CO |CF Prediction error on CO with knowledge about CF partition

ECML-PKDD 2009 – Bled (Slovenia) CO-CLUSTERING Optimization strategy: - τ is asymmetrical, for this reason the algorithm alternates the optimization of two functions τ CO|CF and τ CF|CO - Stochastic optimization (example on rows): # Start with an initial parition on rows for i in 1..n_times # augment the current partition with an empty cluster # Move at random one element from a partition to another one # If obj. func. improve keep solution, else undo the operation # If there is an empty cluster, remove it end - This optimization allows the num. of clusters to grow or decrease - In (Robardet02) an efficient way to update incrementally the objective function was introduced

ECML-PKDD 2009 – Bled (Slovenia) HIERARCHICAL CO-CLUSTERING HiCC: - Hierarchical Co-Clustering algorithm that extends τ- CoClust - Divisive Approach - No parameter settings needed - No predefined number of splits for each node of the hierarchy HIERARCHICAL CO-CLUSTERING

ECML-PKDD 2009 – Bled (Slovenia) HiCC: At the beginning use τ-CoClust repeat - From the current Row/Column partitions - Fix the Column partition - For each cluster in the Row partition Re-cluster with τ-CoClust and optimize the obj. func. τ CO|CF - Update Row Hierarchy - Fix the new Row partition - For each cluster in the Column partition Re-cluster with τ-CoClust and optimize the obj. func. τ CF|newCO - Update Column Hierarchy until (TERMINATION) HIERARCHICAL CO-CLUSTERING

ECML-PKDD 2009 – Bled (Slovenia) HIERARCHICAL CO-CLUSTERING A SIMPLE EXAMPLE Goes on … until termination condition is satisfied

ECML-PKDD 2009 – Bled (Slovenia) RESULTS Experimentation: - No previous hierarchical co-clustering algorithm exists - Use a flat co-clustering algorithm with the same number of clusters obtained by our approach for each level - We choose Information theoretic approach ( KDD03 ) and for each level we perform 50 runs then we compute the average - We use document-word dataset to validate our approach: * OHSUMED (collection of pubmed abstract) {oh0, oh15} * REUTERS ( collected and labeled by Carnegie Group ) {re0, re1} * TREC (Text Retrieval Conference) {tr11, tr21}

ECML-PKDD 2009 – Bled (Slovenia) An example of row hierachy on OHSUMED Enzyme-Activation Cell-Movement Adenosine-Diphosphate Staphylococcal-Infection Uremia Staphylococcal-Infection Memory We label each cluster with the majority class RESULTS

ECML-PKDD 2009 – Bled (Slovenia) An example of column hierachy on REUTERS oil, compani, opec, gold, ga, barrel, strike, mine, lt, explor tonne, wheate, sugar, corn, mln, crop, grain, agricultur, usda, soybean coffee, buffer, cocoa, deleg, consum, ico, stock, quota, icco, produc oil, opec, tax, price, dlr, crude, bank, industri, energi, saudi compani, gold, mine, barrel, strike, ga, lt, ounce, ship, explor tonne, wheate, sugar, corn, grain, crop, agricultur, usda, soybean, soviet mln, export, farm, ec, import, market, total, sale, trader, trade quota, stock, produc, meet, intern, talk, bag, agreem, negoti, brazil coffee, deleg, buffer, cocoa, consum, ico, icco, pact, council, rubber We label each cluster with top 10 words ranked by mutual information RESULTS

ECML-PKDD 2009 – Bled (Slovenia) External Validation Indices: - Purity - Normalized Mutual Information (NMI) - Adjusted Rand Index Hierarchical setting: We combine the hierarchical result with this formula - is one of the external validation indices - is a weight for the hierarchy level i, in our case α i is equal to 1/i RESULTS

ECML-PKDD 2009 – Bled (Slovenia) RESULTS Performance Results

ECML-PKDD 2009 – Bled (Slovenia) RESULTS Performance Results on re1 dataset

ECML-PKDD 2009 – Bled (Slovenia) Conclusions We propose: - New approach to hierarchical co-clustering - Parameter free - No apriori fixed number of splits (n-ary splits) - Obtains good results - Builds simultaneously hierarchies on both dimensions - Improve co-clustering results exploration CONCLUSIONS

ECML-PKDD 2009 – Bled (Slovenia) Future works: - Parallelize the algorithm to improve time performance - Pushing constraints inside it to use background knowledge - Extend the framework to manage continuous data CONCLUSIONS

ECML-PKDD 2009 – Bled (Slovenia) Any Question? Thank you for your attention