18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star.

Slides:

Advertisements

Similar presentations

Lindsey Bleimes Charlie Garrod Adam Meyerson

Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.

PARTITIONAL CLUSTERING

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

Fast Algorithms For Hierarchical Range Histogram Constructions

The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab,

Modularity and community structure in networks

Ricochet A Family of Unconstrained Algorithms for Graph Clustering.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.

Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.

1 COMP4332 Web Data Thanks for Raymond Wong’s slides.

Clustering Unsupervised learning Generating “classes”

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

A Random Walk on the Red Carpet: Rating Movies with User Reviews and PageRank Derry Tanti Wijaya Stéphane Bressan.

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

LexRank: Graph-based Centrality as Salience in Text Summarization

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Online Learning for Collaborative Filtering

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Evaluation of Recommender Systems Joonseok Lee Georgia Institute of Technology 2011/04/12 1.

Understanding Network Concepts in Modules Dong J, Horvath S (2007) BMC Systems Biology 2007, 1:24.

Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.

Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.

A B C D E F A ABSTRACT A novel, efficient, robust, feature-based algorithm is presented for intramodality and multimodality medical image registration.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Google PageRank Algorithm

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Exponential random graphs and dynamic graph algorithms David Eppstein Comp. Sci. Dept., UC Irvine.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Metaheuristics for the New Millennium Bruce L. Golden RH Smith School of Business University of Maryland by Presented at the University of Iowa, March.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.

April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.

Cohesive Subgraph Computation over Large Graphs

Heuristic & Approximation

A paper on Join Synopses for Approximate Query Answering

Link-Based Ranking Seminar Social Media Mining University UC3M

3.3 Network-Centric Community Detection

Panagiotis G. Ipeirotis Luis Gravano

COMP5331 Web databases Prepared by Raymond Wong

Latent Semantic Analysis

Presented by Nick Janus

Presentation transcript:

18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star Clustering Tok Wee Hyong Derry Tanti Wijaya Stéphane Bressan

18th International Conference on Database and Expert Systems Applications Vector Space Clustering Naturally translates into a graph clustering problem for a dense graph Vectors Weight is cosine of corresponding vectors

18th International Conference on Database and Expert Systems Applications Star Clustering for Graph [1] Computes vertex cover by a simple computation of star-shaped dense sub-graphs 1.Lower weight edges are pruned 2.Vertices with higher degree (that are not satellites) are chosen in turn as Star centers 3.Vertices connected to a center become satellites 4.Algorithm terminates when every vertex is either a center or a satellite 5.Each center and its satellites form a cluster

18th International Conference on Database and Expert Systems Applications Star Clustering Does not require the indication of an a priori number of clusters Allows clusters to overlap Analytically guarantees a lower bound on the similarity between objects in each cluster Computes more accurate clusters than either the single or average link hierarchical clustering

18th International Conference on Database and Expert Systems Applications Star Clustering Two critical elements: Threshold for pruning edges ( σ) Metrics for selecting Star centers Aslam et al. [1] derived the theoretical lower bound on the expected similarity between two satellites in a cluster Empirically shown to be a good estimate of the actual similarity Current metrics for selecting Star centers does not leverage this finding  Our focus is on the metrics for selecting Star centers

18th International Conference on Database and Expert Systems Applications Extended Star Clustering Choose Star centers using complement degree of vertices Allow Star centers to be adjacent to one another Has two versions: unrestricted and restricted

18th International Conference on Database and Expert Systems Applications Our proposal Degree may not be the best metrics We propose metrics that considers weights of edges in order to maximize intra-cluster similarity: Markov Stationary Distribution Lower Bound Average Sum

18th International Conference on Database and Expert Systems Applications Markov Stationary Distribution Similar to the idea of Google’s Page Rank algorithm [2] Method: Similarity graph is normalized into a symmetric Markov matrix Compute the stationary distribution of the matrix A* = (I – A) -1 Vertices are sorted by their stationary values and chosen in turn as Star centers

18th International Conference on Database and Expert Systems Applications Lower Bound Theoretical lower bound on expected similarity between satellite vertices: cos(γi,j) ≥ cos(αi) cos(αj)+ (σ / σ + 1) sin(αi) sin(αj) Can be used to estimate the average intra- cluster similarity Lower bound metric is the estimated average intra-cluster similarity when v is a Star center and v.adj are its satellites lb (v) = ((Σ vi  v.adj cos(αi)) 2 + (σ / σ + 1) (Σ vi  v.adj sin(αi)) 2 ) / n 2 Computed on the pruned graph

18th International Conference on Database and Expert Systems Applications Average and Sum Approximations of the lower bound metric Computed on the pruned graph For each vertex v, ave (v) = Σ  vi ∈ v.adj cos(αi) / degree(v) sum (v) = Σ  vi ∈ v.adj cos(αi) Average metric is the square root of the first term in the lower bound metric

18th International Conference on Database and Expert Systems Applications Markov, Lower Bound, Average, Sum Metrics We integrate our proposed metrics in the Star algorithm and its variants to produce: Star-lb Star-sum Star-ave Star-markov Star-extended-sum-(r) Star-extended-ave-(r) Star-extended-sum-(u) Star-extended-ave-(u) Star-online-sum Star-online-ave

18th International Conference on Database and Expert Systems Applications Experiments Compare performance with off-line and on-line Star clustering and restricted and unrestricted Extended Star clustering Use data from Reuters-21578, Tipster-AP, and our original collection: Google Measure effectiveness: recall, precision, F1 Measure efficiency: running time Measure sensitivity to σ

18th International Conference on Database and Expert Systems Applications Off-line Algorithms Star-lb and Star-ave are most effective but Star- ave is much more efficient Star-random performs comparably to original Star when threshold σ is the average similarity

18th International Conference on Database and Expert Systems Applications Off-line Algorithms Effectiveness comparison

18th International Conference on Database and Expert Systems Applications Off-line Algorithms Efficiency comparison

18th International Conference on Database and Expert Systems Applications Order of Stars We empirically demonstrate that Star-ave indeed approximates Star-lb better than other algorithms by a similar choice of Star centers

18th International Conference on Database and Expert Systems Applications Order of Stars (on Tipster-AP)

18th International Conference on Database and Expert Systems Applications Sensitivity to σ As compared to the original Star: Star-ave and Star-markov converge to a maximum F1 at a lower threshold The maximum F1 of Star-ave is higher F1 gradient of Star-ave and Star-markov is smaller

18th International Conference on Database and Expert Systems Applications Sensitivity to σ (F1 on Reuters) σ

18th International Conference on Database and Expert Systems Applications Sensitivity to σ (F1 gradient on Reuters) σ

18th International Conference on Database and Expert Systems Applications Extended Star Star-ave is more effective and efficient than Star-extended-(r) Star-extended-ave-(r) improves the effectiveness of Star-extended-(r) Similar findings are observed with Star- extended-(u)

18th International Conference on Database and Expert Systems Applications Extended Star Effectiveness comparison

18th International Conference on Database and Expert Systems Applications Extended Star Efficiency comparison

18th International Conference on Database and Expert Systems Applications On-line Algorithms Star-online-ave is more effective and efficient than the original Star on-line algorithm

18th International Conference on Database and Expert Systems Applications On-line Algorithms Effectiveness comparison

18th International Conference on Database and Expert Systems Applications On-line Algorithms Efficiency comparison

18th International Conference on Database and Expert Systems Applications Conclusion Current metrics for selecting Star centers is not optimal We propose various new metrics for selecting Star centers that maximize intra-cluster similarity Average metrics is a fast and good approximation of lower bound metrics Since intra-cluster similarity is maximized, it is precision that is mostly improved Our proposed average metrics yield up to 19.1% improvement on precision for off-line algorithms, 20.9% improvement on precision for on-line algorithms, and 102% improvement on precision for extended star algorithm

18th International Conference on Database and Expert Systems Applications References 1.Aslam, J., Pelekhov, K., Rus, D.: The Star Clustering Algorithm. In Journal of Graph Algorithms and Applications, 8(1) 95–129 (2004) 2.Brin Sergey, Page Lawrence: The anatomy of a large-scale hypertextual Web search engine. Proceedings of the seventh international conference on World Wide Web 7, (1998)

18th International Conference on Database and Expert Systems Applications Credits This work was funded by the National University of Singapore ARG project R , "Mind Your Language: Corpora and Algorithms for Fundamental Natural Language Processing Tasks in Information Retrieval and Extraction for the Indonesian and Malay languages" Copyright © 2007 by Stéphane Bressan