A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.

Slides:



Advertisements
Similar presentations
Seyedehmehrnaz Mireslami, Mohammad Moshirpour, Behrouz H. Far Department of Electrical and Computer Engineering University of Calgary, Canada {smiresla,
Advertisements

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.
Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Dept. of Computer Science & Engineering, CUHK1 Trust- and Clustering-Based Authentication Services in Mobile Ad Hoc Networks Edith Ngai and Michael R.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Chapter 1: Data Collection
David Martin Department of Geography University of Southampton 2001 Census: the emergence of a new geographical framework.
Data Mining – Intro.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
D YNAMIC B UILDING OF D OMAIN S PECIFIC L EXICONS U SING E MERGENT S EMANTICS Final Presentation Matt Selway Supervisor: Professor Markus Stumptner.
A N APPROACH TO AUTOMATIC MUSIC PLAYLIST GENERATION USING I T UNES AND BEHAVIORAL DATA By Darrius Serrant, Undergraduate Supervised by Mitsunori Ogihara,
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
OPTIMIZATION OF FUNCTIONAL BRAIN ROIS VIA MAXIMIZATION OF CONSISTENCY OF STRUCTURAL CONNECTIVITY PROFILES Dajiang Zhu Computer Science Department The University.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
Protecting Sensitive Labels in Social Network Data Anonymization.
Mining High Utility Itemset in Big Data
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department,
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Consensus Group Stable Feature Selection
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Data Mining and Machine Learning Group (UH-DMML) Wei Ding Rachana Parmar Ulvi Celepcikay Ji Yeon Choo Chun-Sheng Chen Abraham Bagherjeiran Soumya Ghosh.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department, University.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Model Discovery through Metalearning
Discrete ABC Based on Similarity for GCP
Urban Sensing Based on Human Mobility
Presented by: Dr Beatriz de la Iglesia
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees
A Unifying View on Instance Selection
Yongli Zhang and Christoph F. Eick University of Houston, USA
UH-DMML: Ongoing Data Mining Research
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
CSE572: Data Mining by H. Liu
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science Department University of Houston, TX 1

O UTLINE 1. Motivation 2. Goals 3. Overviews 4. Related work 5. An architecture and algorithms for multi-run clustering 6. Experimental results 7. Conclusion and future works 2

1. M OTIVATION 3 Region discovery framework A family of clustering algorithms A family of plug-in fitness functions Domain experts Region discovery framework A family of clustering algorithms A family of plug-in fitness functions Multi-run clustering Manually select parameters of clustering algorithms Rely on active learning to automatically select parameters of clustering algorithms Cougar^2: Open Source Data Mining and Machine Learning Framework

2. G OALS Given O = { o 1,…, o n } as a spatial dataset. A clustering algorithm seeks for a clustering X that maximizes a fitness function q ( X ). X = { x 1, x 2,…, x k }, x i  x j = , ( i  j ),, and The goal is to automatically find a set of distinct and high quality clusters that originate from different runs. 4

3. O VERVIEWS OF MULTI - RUN CLUSTERING – 1 Key hypothesis: better clustering results can be obtained by combining clusters that originate from multiple runs of a clustering algorithm. 5

3. O VERVIEWS OF MULTI - RUN CLUSTERING – 2 Challenges: Selecting appropriate parameters for an arbitrary clustering algorithm Determining which clusters to be stored as candidate clusters. Generating a final clustering from candidate clusters Alternative clusters, e.g. hotspots in spatial datasets at different granularities 6

4. R ELATED WORK Meta clustering [ Caruana et al ]: early create diverse clusterings, cluster them into groups afterward, and finally let users choose a group of clusterings that is the best for their needs. Ensemble clustering [ Gionis et al. 2005; Zeng et al ]: aggregates different clusterings into one consolidated clustering 7

D EFINITION OF A STATE A state s in a state space S ( S  R 2bm ) : s = { s 1_min, s 1_max,…, s m_min, s m_max }, s i   2b A state s for CLEVER s = { k’ min, k’ max, p min, p max, p’ min, p’ max } 8

5. A N ARCHITECTURE OF MULTI - RUN CLUSTERING SYSTEM State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S1 S2 S4S3 S6 S5 Parameters M M X X M’ Steps in multi-run clustering: S1: Parameter selection. S2: Run a clustering algorithm. S3: Compute a state feedback. S4: Update the state utility table. S5: Update the cluster list M. S6: Summarize clusters discovered M’. 9

P RE - PROCESSING STEP. C OMPUTE NECESSARY STATISTICS TO SET UP MULTI - RUN CLUSTERING SYSTEM. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S0 we run m rounds of CLEVER by randomly selecting k’, p and p’. 10

S TEP 1. S ELECT PARAMETERS OF A CLUSTERING ALGORITHM. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S1  1. Randomly select a state.  2. Choose a state with the maximum state utility value.  3. Choose a state in the neighborhood of the state having the maximum state utility value. Fig. 2. Examples of the policies  P(  1) = 0.2, P(  2) = 0.6, P(  3) = 0.2. s 1 = {k’ min =1, k’ max =10, p min =1, p max =10, p’ min =11, p’ max =20} s 2 = {k’ min =11, k’ max =20, p min =41, p max =50, p’ min =31, p’ max =40} Selected state: {k’=12, p=45, p’=40} 11

S TEP 2. R UN CLEVER TO GENERATE A CLUSTERING WITH RESPECT TO GIVEN PARAMETERS. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S2 k’=12, p=45, p’=40 Parameters 12 Fitness Function:

S TEP 3. C OMPUTE A STATE UTILITY. State Utility Learning Clustering Algorithm Clustering Algorithm Storage Unit Cluster Summarization Unit S3 A relative clustering quality function (RCQ) Novelty(X,M) = (1 - similarity(X,M))  Enhancement(X,M) M X 13 X = {x 1,…,x k }, and y i be the most similar cluster in the stored cluster list M to x i  X. RCQ(X,M) = Novelty(X,M) x ||Speed(X)|| x ||q(X)||

S TEP 4. U PDATE A STATE UTILITY. State Utility Learning Storage Unit Cluster Summarization Unit S4 Clustering Algorithm Clustering Algorithm Utility Update U’ 14

S TEP 5. U PDATE CLUSTER LISTS TO MAINTAIN A SET OF DISTINCT AND HIGH QUALITY CLUSTERS. State Utility Learning Storage Unit Cluster Summarization Unit S5 Clustering Algorithm Clustering Algorithm X Let M be the current set of multi-run clusters. X be a new clustering to be processed for updating M.  sim be a similarity threshold. r th be a reward storage threshold. X will be processed as follows: FOR c  X DO Let m be the most similar cluster in M to c. IF sim ( m, c )>  sim AND Reward ( m )< Reward ( c ) THEN replace ( m, c, M ) ELSE IF Reward ( c )> r th THEN insert ( c, M ) ELSE discard ( c ); Fig. 3. Cluster List Management algorithm (CLM) 15

S TEP 6. G ENERATE A FINAL CLUSTERING. State Utility Learning Storage Unit Cluster Summarization Unit S6 Clustering Algorithm Clustering Algorithm M M’ 16 Dominance-guided Cluster Reduction algorithm (DCR) Dominance graphs : a dominant cluster : dominated clusters A A B C D E F D E F AD

6. E XPERIMENTAL EVALUATION – 1 Evaluation of multi-run clustering on earthquake dataset* Show how multi-run clustering can discover interesting and alternative clusters in spatial data. Be interested in areas where deep earthquakes are in close proximity to shallow earthquakes. Use the High Variance function ( i ( c )) [Rinsurongkawong 2008] to find such regions. 17 *: earthquake dataset is available on the website of the U.S. Geological Survey Earthquake Hazards Program

6. E XPERIMENTAL EVALUATION – 2 Fig. 6. Top 5 clusters of X TheBestRun (ordered by reward) Fig. 7. Multi-run clustering results: clusters in M’. 18

6. E XPERIMENTAL EVALUATION – 3 Our system can find 70% of the new and high- quality clusters that do not exist in the best single run. With overlapping threshold of 0.2, there are 43% of the positive-reward clusters of the best run are not in M’. 19

6. E XPERIMENTAL EVALUATION – 4 Fig. 8. Overlay the multi-run clustering result (in color) by the top 5 rewards clusters of the best run (in black). 20

7. C ONCLUSION – 1 Propose an architecture and a concrete system for multi-run clustering to cope with parameters selection of a clustering algorithm, and to obtain alternative clusters in highly automated fashion; Uses active learning to automate the parameter selection, and various techniques to find both different clusters and good clusters on the fly. Propose Dominance-guided Cluster Reduction algorithm that post-processes clusters from the multiple runs to generate a final clustering by restricting cluster overlap. 21

7. C ONCLUSION – 2 The experimental result on earthquake dataset supports our claim that multi-run clustering outperforms single-run clustering with respect to clustering quality. Multi-run clustering can discover additional novel, alternative, high-quality clusters and enhance the quality of clusters found using single-run clustering. 22

7. F UTURE WORK Systematically evaluate the use of utility learning in choosing parameters of a clustering algorithm. Ultimate goal is to construct multi-run multi- objective clustering in one system. 23

T HANK YOU 24