INARC Charu C. Aggarwal (I2 Contributions) Scalable Graph Querying and Indexing Task I2.2 Charu C. Aggarwal IBM Collaborators (across all tasks): Jiawei.

Slides:



Advertisements
Similar presentations
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Advertisements

1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Dynamic Bayesian Networks (DBNs)
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
HCS Clustering Algorithm
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks Maurice Chu, Horst Haussecker and Feng Zhao Xerox Palo.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
Recommender systems Ram Akella November 26 th 2008.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Data Mining Techniques
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Task 1 (I1.1): Fundamentals of Context-aware Real-time Data Fusion.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
On Node Classification in Dynamic Content-based Networks.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
SCALABLE INFORMATION-DRIVEN SENSOR QUERYING AND ROUTING FOR AD HOC HETEROGENEOUS SENSOR NETWORKS Paper By: Maurice Chu, Horst Haussecker, Feng Zhao Presented.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Concept Switching Azadeh Shakery. Concept Switching: Problem Definition C1C2Ck …
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
SCALABLE INFORMATION-DRIVEN SENSOR QUERYING AND ROUTING FOR AD HOC HETEROGENEOUS SENSOR NETWORKS Paper By: Maurice Chu, Horst Haussecker, Feng Zhao Presented.
Lead: Roth (UIUC) Abdelzaher (UIUC) Huang (UIUC) Lei (IBM) Presented by: Tarek Abdelzaher.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Graphs. Graphs Similar to the graphs you’ve known since the 5 th grade: line graphs, bar graphs, etc., but more general. Those mathematical graphs are.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Query-based Graph Cuboid Outlier Detection
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
Probabilistic Data Management
ICICLES: Self-tuning Samples for Approximate Query Answering
Query-Friendly Compression of Graph Streams
CSE572, CBS572: Data Mining by H. Liu
Range-Efficient Computation of F0 over Massive Data Streams
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Overview: Chapter 2 Localization and Tracking
Presentation transcript:

INARC Charu C. Aggarwal (I2 Contributions) Scalable Graph Querying and Indexing Task I2.2 Charu C. Aggarwal IBM Collaborators (across all tasks): Jiawei Han (UIUC), Guojun Qi (UIUC), Thomas Huang (UIUC), Tarek Abdelzaher (UIUC), Xifeng Yan (UCSB), Arjit Khan (UCSB), Nan Li (UCSB), Amotz Bar-Noy (CUNY), Simon Shamoun (CUNY) Other Tasks: I1.2 (funded), I3 (collaborator)

INARC Overview of Contributions  Project I1 Contributions:  Methods for Sensor Selection in Dynamic Information Networks (Collaboration with A. Bar-Noy (CUNY) and S. Shamoun (CUNY))=>Submitted to DCOSS 2011  Data fusion of heterogeneous data with the use of network links: specific application to text and visual data  QoI robust inference with the use of heterogeneous data fusion  Joint work with G. Qi (UIUC) and T. Huang (UIUC)  Papers accepted in CVPR, WWW 2011; one submitted to KDD 2011  Collaboration with Tarek Abdelzaher (UIUC) on data selection for regression and fact-finders for fusion=> Submitted to DCOSS 2011, Fusion 2011  Project I2 Contributions: Large Scale Indexing (Focus of this talk):  Methods for Indexing Massive Disk-Resident Graphs (Aggarwal (IBM), Zhao (UIUC), and Han (UIUC))=> Submit to PVLDB 2012  Methods for Indexing Dynamic Network Streams (Aggarwal (IBM), Khan (UCSB), Yan (UCSB))=> Submit to PVLDB 2012  Methods for label-based query index (Joint work with Li (UCSB))=>SDM 2011

INARC Overview of Contributions  Project I3 Contributions:  Unfunded Collaborator for project I3=> Actively collaborated with Jiawei Han on mining of networks with heterogeneous links and incomplete attributes (Task I3.1)  Designed methods for clustering heterogeneous information networks with heterogeneous links and incomplete attributes (joint work with Yizhou Sun (UIUC) and Jiawei Han (UIUC)) => Submitted to KDD 2011  Designed methods for link inference in the noise and heterogeneous information network scenario (joint work with Barbier (UIUC), Gupta (UIUC), Sun (UIUC) and Han (UIUC))=> Submitted to ASONAM 2011

INARC Scalable Indexing and Querying Massive Graphs (I2.2)  Indexing Methods for Large Scale Static and Dynamic Networks  Methods for Indexing Massive Disk-Resident Graphs (Aggarwal (IBM), Zhao (UIUC), and Han (UIUC))  Application to Shortest Path Queries  Can be extended to connectivity and other structural queries  Methods for Indexing Dynamic Network Streams (Aggarwal (IBM), Khan (UCSB), Yan (UCSB))  Applications such as social and information networks show continuous edge-base activity  Results in edge streams  Methods for frequency-based and structural indexing of graph streams

INARC Label-based Querying of Massive Dynamic Graphs  Many networks have labels associated with some of the nodes which may need to be learned or queried  Eg. Which node belongs to a specific topic?  Challenges:  Network may be dynamic with edges and nodes which continuously evolve over time  The nodes in the network may contain content, which needs to be used in the classification process  The classification-queries need to be resolved in real-time.  Designed methods for constructing dynamic structural index, which can be continuously updated and used for label-based queries (Aggarwal (IBM) and Li (UCSB)): Accepted to SIAM Conference on Data Mining, 2011.

INARC Indexing Graphs: The compression approach  Core idea is to coarsen the graph into a smaller number of nodes  Method for coarsening depends upon application:  In static (disk-resident) methods, we use dense pattern mining methods in order to coarsen graph into a smaller number of nodes  In dynamic (stream-based) methods, we use a hash-based probabilistic sketch in order to coarsen graph into a smaller number of nodes.  Use probabilistic query-processing on the coarsened graph in order to provide approximate responses to queries

INARC Static Indexing of Disk-Resident Graphs  Design methods for indexing massive disk-resident networks  Typical social and information networks may be massive and may need to be stored on disk  For example, for a network containing 10^7 nodes, we may have of the order of 10^(13) edges  Network cannot be held in main memory, and in some cases not even on disk  First present methods where network cannot be held in main memory, but is stored on disk  Design methods for shortest path indexing methods: For a given pair of nodes s and t, determine shortest distance between them  Very fast for memory resident graphs=> Very slow for disk- resident case

INARC Static Indexing of Disk-Resident Graphs  Core Approach for static indexing: Determine dense sets of nodes in the network using a pattern mining approach  Compress sets of nodes into supernodes  Node compression will result in self-loops: Eliminate self-loops  Perform query-processing on compressed graph  Note that passing through super-node is equivalent to passing through a set of edges: associate node-penalty with super-node  Observation: Vast majority of edges lie in dense regions which are compressed into self-loops  Massive reduction in size of network: allows main-memory storage for query-processing

INARC Setting Node Penalties (Observations)  When we run the query-processing technique on the coarsened graph as a surrogate, passing through a super-node is modeled as entering through a random node in the compressed subgraph and exiting another random node  Use probabilistic node penalty associated with such nodes corresponding to pairwise shortest-path distances in the compressed portion  Modified Query Processing: Determine shortest-path distance between s’ and t’ in compressed graph with node-penalties  Note that node-penalties are random variables and the shortest- path distance could be average or worst-case.  Important: Dense region compression leads to histogram with small number of buckets (most distances are 1 or 2 within region).

INARC Setting Node Penalties

INARC Query Processing  Start at node s’ and progressively maintain the random variables corresponding to the shortest average (or worst-case) distances to the nodes.  Maintain probabilistic label with each node containing the probabilistic histogram with shortest-path distance  Grow set S of nodes with probabilistic shortest path distances  In each iteration, we grow set S by one node after updating the node labels of its nearest neighbors (requires probabilistic addition of distances of shortest path to edge weight and node penalty).  Terminate on reaching sink node t’  Provides either approximation using compressed graph or access compressed fragments from disk to get exact distances

INARC Adding shortest path distances to node penalty

INARC Other details of approach  Other algorithmic components of approach  A dense pattern mining algorithm which is designed in order to determine the dense regions of the network for compression purposes  Design a sampling-based approach which uses a limited number of passes over the disk-resident data set  A method for reconstructing the shortest path from the compressed shortest path distance  Retrieve fragments from disk-resident data set  Use shortest path distance within each fragment to reconstruct the distance

INARC Experimental Results (DBLP Data Set)

INARC Dynamic Indexing of Massive Graph Streams (Joint work with A. Khan (UCSB) and X. Yan (UCSB))  The previous methods were designed for static indexing of massive graphs  In many applications, we need indexes for the case in which the graphs may continuously be updated over time (edge activity)  Edge-based activity in networks result in a stream of edges  Activity in a social network can be modeled as an edge-stream  Assumption is much more rigorous in previous case  Not enough space to hold even the edges on disk and their arrival times  Not enough space to hold even the total number of possible distinct edges on disk  Hard to perform structural analysis when the network is so dynamic and cannot even be stored at a given point in time

INARC Dynamic Indexing of Massive Graph Streams  Use a 3-dimensional hash-based sketch structure in order to create the dynamic index  Borrows the concept of a sketch structure from data stream analysis and generalizes it to the case of structural data  Queries which can be handled:  Determine the frequency of a given edge in the graph  Determine the aggregate frequency of edges in a subgraph  Determine the minimum frequency of edges in a subgraph  Determine all edges with frequency above a given threshold  Determine the connected components with frequency above a given threshold

INARC The sketch based approach: Core Idea  Model: Assume that edges are received continuously over time  Use a 3-dimensional matrix of size h X h X w  The parameter h indicates the range of the hashing  The parameter w indicates the number of hash functions  For an edge (i, j) the node i is hashed into the first dimension using k-th hash function g^k() with range [0, h-1]  The node j is hashed into the second dimension using k-th hash function g^k() in the range [0, h-1]  There are w slices of size h X h corresponding to different hash functions g^k() where k lies in the range [1, w]  Multiple hash functions provide robustness  The entire frequency behavior of edges is mapped into the structure of size h X h X w

INARC The sketch based approach: Core Idea  Model: Assume that edges are received continuously over time  Use a 3-dimensional matrix of size h X h X w  The parameter h indicates the range of the hashing  The parameter w indicates the number of hash functions  For an edge (i, j) the node i is hashed into the first dimension using k-th hash function g^k() with range [0, h-1]  The node j is hashed into the second dimension using k-th hash function g^k() in the range [0, h-1]  There are w slices of size h X h corresponding to different hash functions g^k() where k lies in the range [1, w]  Multiple hash functions provide robustness  The entire frequency behavior of edges is mapped into the structure of size h X h X w

INARC The sketch based approach: Stream Incremental Update  The update process is straightforward and requires an incremental step involving the application of w different hash functions for each incoming edge (i, j)  Start off by setting each cell in the h X h x w synopsis matrix V to 0  Assume incoming edge (i, j) and frequency r(i, j)  For each incoming edge (i, j) and k-th hash function g^k(), update the cell V(g^k(i), g^k(j), k) by r(i, j)  The update process is applied for each of the w different hash functions

INARC Subgraph Frequency-based Queries  Query: Determine frequency of edges in subgraph with node set S  For each edge (i, j) in subgraph S, determine the value of V(g^k(i), g^k(j), k) for different hash functions k  Determine the minimum value of V(g^k(i), g^k(j), k) over the different hash functions k  Sum up value over different edges  Key Result: Derived mathematical error bound => With probability at least 1 – (|S|*|S|/(h*h*f))^w the error is at most L.f, where L is the total frequency of edges received=> Details of proof available in paper.

INARC Inverse Frequency-based Queries  Query: Determine all edges with frequency above a given threshold  Key Augmentation to Data Structure: Each hash cell contains an inverted list of all the nodes which map to it  For a given frequency threshold determine all hash cells (for each hash function) above the threshold  Determine the intersection of all the inverted lists pointed to by determined hash cells=> using multiple hash functions to reduce collision error.  Report only edges which are present in the intersection of the hash functions => error reduces logarithmically with number of hash functions

INARC Connected-subgraph Queries  Query: Determine all connected subgraphs containing only edges with frequency above a given threshold  Determine edges using the approach discussed in previous query  Determine connected components with the use of the found edges  Numerous other structural queries can be handled with the use of the compressed graph.  Error bounds hold for a subset of queries

INARC Experimental Results (last.fm Data Set)

INARC Label-based querying of Massive Graphs (Joint work With N. Li (UCSB))  We assume that we have a massive graph with nodes interconnected by edges, and nodes which contain content.  The nodes and edges of the graph are continuously received over time.  Need to respond to label-based queries dynamically using both the structure and content  Need to use the content in the classification process  Core Idea:  Construct semi-bipartite augmentation of the network which uses the content in the form of synthetic word nodes  Create a fast inverted index which can respond to label-based queries.

INARC Augmented Bipartite Network

INARC Key Results  Method is designed for dynamic classification queries (no static pre-processing)  A dynamic classification index  Maintain the semi-bipartite augmentation dynamically  Maintain inverted representation with the word nodes=>Incorporation of content with structure  Perform random walk starting at query node and report majority label as the appropriate class  Theoretical Result: Sampling a certain number of walks reduces the error estimation exponentially=> Proof available in paper => Use of Hoeffding bound  Paper accepted in SDM, 2011.

INARC Experimental Results

INARC Military Relevance  Military information networks are extremely large and require dynamic and time-varying methods for querying and indexing  Require real-time responses to structural queries for data which is large and may be added to rapidly => network stream scenario  Methods are also applicable to resource-constrained environments even for very large data sets  Classification query provides methods to identify nodes of relevance in a given information network query => which is the node most relevant to a given topic? => Where is the information I want?

INARC The Path Ahead: Future Work  Extend the kinds of queries which can be handled by different kinds of methods:  Structural connectivity queries  Shortest paths in dynamic graph streams  Extend first method to dynamic case  Extend methods to uncertain graphs  Challenge of uncertainty: Expected path length methods do not work!  Even a small probability of disconnection of source and sink leads to expected path length of infinity  Design probabilistic model which uses the shortest path with threshold probability

INARC Publications (Accepted)  C. C. Aggarwal (IBM), N. Li (UCSB). On Dynamic Node-Classification in Content-based Networks, Accepted to the SDM Conference,  G. Qi (UIUC), C. C. Aggarwal (IBM), T. Huang. (UIUC) Towards Semantic Knowledge Propagation between Text and Web Images, Accepted to the WWW Conference,  G. Qi (UIUC), C. C. Aggarwal (IBM), T. Huang (UIUC). Towards Cross- Category Learning of Visual Concepts. Accepted to the CVPR Conference,  C. C. Aggarwal (IBM), A. Khan (UCSB), X. Yan (UCSB). On Flow Authority Discovery in Social Networks, Accepted to the SDM Conference,  C. C. Aggarwal, Y. Xie, P. Yu. On Community Discovery in Locally Heterogeneous Networks, Accepted to the SDM Conference,  C. C Aggarwal, Y. Zhao, P. Yu. On Wavelet Decomposition of Uncertain Streams, CIKM Conference,  M. Masud, L. Khan, B. Thuraisingham, C. Aggarwal (IBM), J Gao (UIUC), J. Han (UIUC), On Novel Class Detection in Concept Drifting Data Streams, ICDM Conference, 2010

INARC Publications (Submitted)  G. Qi (UIUC), C. C. Aggarwal (IBM), T. Huang (UIUC). Community Detection with Edge Content, Submitted to the SIGIR Conference,  Y. Sun (UIUC), C. C. Aggarwal (IBM), J. Han (UIUC), On Community Discovery in Heterogeneous Networks with Incomplete Attributes, Submitted to the KDD Conference,  G. Qi (UIUC), C, C, Aggarwal (IBM), T. Huang (UIUC), Transfer Learning with Distance Functions between Text and Images, Submitted to the KDD Conference,  C. C. Aggarwal (IBM), A. Bar-Noy (CUNY), S. Shamoun (CUNY). On Sensor Selection in Linked Information Networks, Submitted to the DCOSS Conference, 2011.

INARC Publications (Submitted)  Y. Sun (UIUC), C. C. Aggarwal (IBM), J. Han (UIUC), On Link Inference in Bibliographic Networks, Submitted to the ASONAM Conference,  M. Gupta (UIUC), C, C, Aggarwal (IBM), J. Han (UIUC), On Evolutionary Clustering and Analysis of Heterogeneous Information Networks, Submitted to the ASONAM Conference,  C. C. Aggarwal (IBM), A. Bar-Noy (CUNY), S. Shamoun (CUNY). On Sensor Selection in Linked Information Networks, Submitted to the DCOSS Conference,  C. C. Aggarwal (IBM), P. Zhao (UIUC), J. Han (UIUC). On Shortest Path Indexing with Disk-Resident Graphs, Submitted to the PVLDB Conference,  C. C. Aggarwal (IBM), A. Khan (UCSB), X. Yan (UCSB). On Query Processing of Massive Graph Streams, Submitted to the PVLDB Conference, 2012.

INARC Publications (Submitted)  T. Abdelzaher (UIUC), D. Wang (UIUC), H. Ahmadi (UIUC), J. Pasternack (UIUC), M. Gupta (UIUC), J. Han (UIUC), O. Fetemieh (UIUC), H. Le (UIUC) C. C. Aggarwal (IBM), On Bayesian Information of Fact-Finding in Information networks, Submitted to the Fusion Conference,  D. Wang (UIUC), H. Ahmadi (UIUC), T. Abdelzaher (UIUC), H. Chenji. R. Stoleru, C, C, Aggarwal (IBM). Data Models for Optimizing Quality-of- Information in Cost-Sensitive Data Fusion, Submitted to the DCOSS Conference, 2011.