Andreas Papadopoulos - [WI 2013] IEEE/WIC/ACM International Conference on Web Intelligence Nov. 17-20, 2013 Atlanta, GA USA A. Papadopoulos,

Slides:



Advertisements
Similar presentations
An Online News Recommender System for Social Networks Department of Computer Science University of Illinois at Urbana-Champaign Manish Agrawal, Maryam.
Advertisements

BiG-Align: Fast Bipartite Graph Alignment
The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
The Small World of Software Reverse Engineering Ahmed E. Hassan and Richard C. Holt SoftWare Architecture Group (SWAG) University Of Waterloo.
Analysis and Modeling of Social Networks Foudalis Ilias.
University of Buffalo The State University of New York Spatiotemporal Data Mining on Networks Taehyong Kim Computer Science and Engineering State University.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Overview of Web Data Mining and Applications Part I
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
GRAPH ALGORITHMS AND DATABASES Dušan Zeleník. Common Tasks  Sorting  Searching  Inferring  Routing.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
On Finding Fine-Granularity User Communities by Profile Decomposition Seulki Lee, Minsam Ko, Keejun Han, Jae-Gil Lee Department of Knowledge Service Engineering.
Discovering Meta-Paths in Large Heterogeneous Information Network
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.
Uncovering Overlap Community Structure in Complex Networks using Particle Competition Fabricio A. Liang
On Node Classification in Dynamic Content-based Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Mining and Visualizing the Evolution of Subgroups in Social Networks Falkowsky, T., Bartelheimer, J. & Spiliopoulou, M. (2006) IEEE/WIC/ACM International.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Stefanos Antaris Distributed Publish/Subscribe Notification System for Online Social Networks Stefanos Antaris *, Sarunas Girdzijauskas † George Pallis.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Term Project Proposal By J. H. Wang Apr. 7, 2017.
Exploring Social Tagging Graph for Web Object Classification
Discrete ABC Based on Similarity for GCP
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
by Hyunwoo Park and Kichun Lee Knowledge-Based Systems 60 (2014) 58–72
Collective Network Linkage across Heterogeneous Social Platforms
DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Community Distribution Outliers in Heterogeneous Information Networks
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Department of Computer Science University of York
Adaptive entity resolution with human computation
Graph Clustering Based on Structural/Attribute Similarities
Efficient Cache-Supported Path Planning on Roads
Kostas Kolomvatsos, Christos Anagnostopoulos
Presentation transcript:

Andreas Papadopoulos - [WI 2013] IEEE/WIC/ACM International Conference on Web Intelligence Nov , 2013 Atlanta, GA USA A. Papadopoulos, G. Pallis, M. D. Dikaiakos Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks

Andreas Papadopoulos - [WI 2013] DBLP Bibliographic Network An Online Social Network The Real World: Information Networks 2 Model such Information Networks as attributed multi-graphs

Andreas Papadopoulos - [WI 2013] Clustering The process of identifying groups of related data/objects in a dataset/information network Why? Discover hidden knowledge! 3 Co-author Networks Social Networks Clustering Recommend friendships, group targeted advertisement Recommending new collaborations NetworkApplications

Andreas Papadopoulos - [WI 2013] Challenges A vertex may belong to more than one cluster  Fuzzy clustering Cluster based on:  Structure  Attributes IR DM AI IR 4

Andreas Papadopoulos - [WI 2013] Challenges IR DM AI IR 5 Cluster based on: -Attributes -Structure

Andreas Papadopoulos - [WI 2013] Challenges IR DM AI IR Communities? 6 Cluster based on: -Attributes -Structure

Andreas Papadopoulos - [WI 2013] Challenges IR DM AI IR Similar Connectivity? 7 Cluster based on: -Attributes -Structure

Andreas Papadopoulos - [WI 2013] Challenges IR DM AI IR 8 Cluster based on: -Structure -Attributes

Andreas Papadopoulos - [WI 2013] Challenges How to balance the attribute and structural properties of the vertices? How to identify which link type is more important? A request to join a political group is more important than sharing a funny video How to identify which attribute is more important? The attribute political views of a person is clearly more important than its name or gender 9

Andreas Papadopoulos - [WI 2013] HASCOP Related Work Distance Based SA-Cluster (ACM TKDD 2011) Graph augmentation with attributes and random walks Different attributes importance PICS (SIAM SDM 2012) MDL Compression Similar connectivity Parameter Free Model Based BAGC (SIGMOD 2012) Bayesian Inference Model Directed graphs GenClus (VLDB 2012) EM algorithm Multi-graphs Different link types importance 10

Andreas Papadopoulos - [WI 2013] Related Work [1] H. Cheng, Y. Zhou, and J. X. Yu. Clustering large attributed graphs: A balance between structural and attribute similarities. ACM Trans. Knowl. Discov. Data, Feb [2] Z. Xu, Y. Ke, Y. Wang, H. Cheng, and J. Cheng. A model-based approach to attributed graph clustering. In Proceedings of SIGMOD ’12, [3] Y. Sun, C. C. Aggarwal, and J. Han. Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. Proc. VLDB Endow., Jan [4] L. Akoglu, H. Tong, B. Meeder, and C. Faloutsos. PICS: Parameter-free identification of cohesive subgroups in large attributed graphs. SDM, Apr DirectedWeighted Multi graph Attribute weights Link-Type weights Similar Connectivity Parameter Free SA-Cluster[1] ✔✔✔ BAGC[2] ✔ GenClus[3] ✔✔✔ PICS[4] ✔✔✔ HASCOP ✔✔✔✔✔✔

Andreas Papadopoulos - [WI 2013] HASCOP Objective Function Similar Connectivity Attribute Coherence Weight Adjustment Mechanism Clustering Process

Andreas Papadopoulos - [WI 2013] HASCOP Given function s(v i, c j ) the clustering objective function is: 13 Assigns vertices in the same cluster so as to exhibit both similar connectivity and attribute coherence

Andreas Papadopoulos - [WI 2013] IR DM AI IR Two vertices v i, v j have similar connectivity pattern if S(v i ) and S(v j ) highly overlap, where S(v i ) is the set containing v i itself and the vertices that v i connects to 14 Similar Connectivity represents how similar two vertices are based on their outgoing links Similar Connectivity

Andreas Papadopoulos - [WI 2013] 15 L0L0 Similar Connectivity

Andreas Papadopoulos - [WI 2013] Attribute Coherence Weighted Euclidean distance It is close to one if the attribute vector of v i is very close to the attribute centroid of c j 16

Andreas Papadopoulos - [WI 2013] HASCOP: Approach A vertex has high similarity with a cluster if both their similar connectivity and attribute coherence are high. 17

Andreas Papadopoulos - [WI 2013] Weight Adjustment Voting mechanism The weights are adjusted towards the direction of increasing the clustering objective function: If vertices in the same cluster are connected by link-type A then the weight of link-type A is increased If vertices in the same cluster share the same value for an attribute X then the weight of attribute X is increased 18

Andreas Papadopoulos - [WI 2013] Clustering Process Calculate Similarities Based on Structural and Attribute Properties Update clusters based on the new clustering configuration Θ Update Weights of Attributes and Link types Resulted Clusters 19 No Each vertex is assigned in a cluster Clusters converge? Yes Parallelized

Andreas Papadopoulos - [WI 2013] Evaluation Datasets Evaluation Measures Evaluations

Andreas Papadopoulos - [WI 2013] Datasets GoogleSP-23: Google Software Packages Built from software files installed on Cloud Software files are not densely connected components Vertex: software file Attributes: File Size File Type Last Access Time Last Content Modified Time Time of the most recent metadata change Link-types: File name similarities File path similarities

Andreas Papadopoulos - [WI 2013] Datasets DBLP: Bibliography Network Vertex: author Attributes: Number of publications Research area Link-types: Co-author relationship DatasetDBLP-1000GoogleSP-23 Nodes Edges Attributes25 Link Types12 Type of Graph Undirected 22

Andreas Papadopoulos - [WI 2013] Evaluation Measures Entropy Attribute properties Close to zero for attribute cohesive clusters For GoogleSP-23 dataset we measure: The percentage of clusters overlapping with a software package The percentage of software packages that were actually identified 23

Andreas Papadopoulos - [WI 2013] Evaluation – DBLP Very close to the lowest average entropy

Andreas Papadopoulos - [WI 2013] Evaluation – DBLP Successfully identified the importance of “Area of interest” against “No of publications”

Andreas Papadopoulos - [WI 2013] Evaluation – GoogleSP-23 Comparison to the “ground truth” Must identify the software packages HASCOP is closest to the “optimal” entropy 26

Andreas Papadopoulos - [WI 2013] Evaluation – GoogleSP-23 HASCOP found 51 clusters More than 80% of returned clusters by HASCOP and PICS are consisted of files from the same software packages 27

Andreas Papadopoulos - [WI 2013] Evaluation – GoogleSP-23 Almost all clusters ( >90% ) returned by HASCOP have full overlap with a software package Almost all (21 of 23) software packages have been identified 28

Andreas Papadopoulos - [WI 2013] Conclusions Future Work

Andreas Papadopoulos - [WI 2013] Conclusions HASCOP succeeded in returning clusters useful to many applications studying such information networks Correctly identified software packages installed on a Cloud infrastructure Experiments confirmed that HASCOP finds clusters characterized by attribute homogeneity Similar Connectivity is important 30

Andreas Papadopoulos - [WI 2013] Future Work Integrate into MinerSoft 1 (a software file search engine) Extend HASCOP to handle: Weighted multi-graphs Heterogeneous information networks Deploy to a large scale Hadoop cluster 1: Minersoft is available at: 31

Andreas Papadopoulos - [WI 2013] Andreas Papadopoulos A. Papadopoulos, G. Pallis, M. D. Dikaiakos { andpapad, gpallis, mdd cs.ucy.ac.cy Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks Thank You! Laboratory for Internet Computing Department of Computer Science University of Cyprus 32

Andreas Papadopoulos - [WI 2013] References [1] L. Akoglu, H. Tong, B. Meeder, and C. Faloutsos. Pics: Parameter-free identification of cohesive subgroups in large attributed graphs. In Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012, Anaheim, CA, April [2] H. Cheng, Y. Zhou, and J. X. Yu. Clustering large attributed graphs: A balance between structural and attribute similarities. ACM Trans. Knowl. Discov. Data, 5(2):12:1–12:33, Feb [3] I. Choi, B. Moon, and H.-J. Kim. A clustering method based on path similarities of xml data. Data Knowl. Eng., 60(2):361–376, Feb [4] M. D. Dikaiakos, A. Katsifodimos, and G. Pallis. Minersoft: Software retrieval in grid and cloud computing infrastructures. ACM Trans. Internet Technol., 12(1):2:1–2:34, July [5] S. Jenkins and S. Kirk. Software architecture graphs as complex networks: A novel partitioning scheme to measure stability and evolution. Information Sciences, 177(12):2587 – 2601,

Andreas Papadopoulos - [WI 2013] References [6] S. E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27 – 64, [7] Y. Sun, C. C. Aggarwal, and J. Han. Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. Proc. VLDB Endow., 5(5):394–405, Jan [8] Z. Xu, Y. Ke, Y. Wang, H. Cheng, and J. Cheng. A model-based approach to attributed graph clustering. In Proceedings of the 2012 international conference on Management of Data, SIGMOD ’12, pages 505–516, New York, NY, USA, ACM. [9] X. Zheng, D. Zeng, H. Li, and F. Wang. Analyzing open-source software systems as complex networks. Physica A: Statistical Mechanics and its Applications, 387(24):6190– 6200, [10] Y. Zhou, H. Cheng, and J. Yu. Clustering large attributed graphs: An efficient incremental approach. In Data Mining (ICDM), IEEE 10th International Conference on, pages 689–698, Dec [11] Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. Proc. VLDB Endow., 2(1):718–729, Aug

Andreas Papadopoulos - [WI 2013] Minersoft

Andreas Papadopoulos - [WI 2013] Software Graph 36

Andreas Papadopoulos - [WI 2013] Evaluation – Runtime For the GoogleSP-23 dataset containing two different types of links, and five attributes HASCOP converged at the fourth iteration Weight adjustment process is time consuming The importance of each link type and attribute has been successfully identified HASCOP converges faster HASCOP reveals important characteristics of the information network under study 37