Graph Summaries for Subgraph Frequency Estimation 1 Angela Maduko, 2 Kemafor Anyanwu, 3 Amit Sheth, 4 Paul Schliekelman 1 LSDIS Lab, University of Georgia.

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Learning Rules from System Call Arguments and Sequences for Anomaly Detection Gaurav Tandon and Philip Chan Department of Computer Sciences Florida Institute.

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.

gSpan: Graph-based substructure pattern mining

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Fast Algorithms For Hierarchical Range Histogram Constructions

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Data Mining Association Analysis: Basic Concepts and Algorithms

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences.

Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

CS Lecture 9 Storeing and Querying Large Web Graphs.

1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Data Mining Association Analysis: Basic Concepts and Algorithms

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Ensemble Learning: An Introduction

Novel Self-Configurable Positioning Technique for Multihop Wireless Networks Authors ： Hongyi Wu Chong Wang Nian-Feng Tzeng IEEE/ACM TRANSACTIONS ON NETWORKING,

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.

AM Recitation 2/10/11.

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

G O D D A R D S P A C E F L I G H T C E N T E R 1 Global Precipitation Measurement (GPM) GV Data Exchange Protocol Mathew Schwaller GPM Formulation Project.

-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.

Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud,

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

SemRank: Ranking Complex Relationship Search Results on the Semantic Web Kemafor Anyanwu, Angela Maduko, Amit Sheth LSDIS labLSDIS lab, University of Georgia.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

23 November Md. Tanvir Al Amin (Presenter) Anupam Bhattacharjee Department of Computer Science and Engineering,

Searching and Ranking Documents based on Semantic Relationships PaperPaper presentation ICDE Ph.D. Workshop 2006 April 3rd, 2006, Atlanta, GA, USA This.

GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Computer Science 1 Using Clustering Information for Sensor Network Localization Haowen Chan, Mark Luk, and Adrian Perrig Carnegie Mellon University

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

VizTree Huyen Dao and Chris Ackermann. Introducing example

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

Zijian Wang, Eyuphan Bulut, and Boleslaw K. Szymanski Center for Pervasive Computing and Networking and Department of Computer Science Rensselaer Polytechnic.

Discovering and Ranking Semantic Associations over a Large RDF Metabase Chris Halaschek, Boanerges Aleman- Meza, I. Budak Arpinar, Amit P. Sheth 30th International.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Gspan: Graph-based Substructure Pattern Mining

Boosted Augmented Naive Bayes. Efficient discriminative learning of

A Black-Box Approach to Query Cardinality Estimation

Probabilistic Data Management

Issues in Decision-Tree Learning Avoiding overfitting through pruning

A Consensus-Based Clustering Method

Query-Friendly Compression of Graph Streams

The Importance of Communities for Learning to Influence

Gong Cheng, Yanan Zhang, and Yuzhong Qu

Structure and Content Scoring for XML

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Structure and Content Scoring for XML

Decision Trees for Mining Data Streams

Topological Signatures For Fast Mobility Analysis

Presentation transcript:

Graph Summaries for Subgraph Frequency Estimation 1 Angela Maduko, 2 Kemafor Anyanwu, 3 Amit Sheth, 4 Paul Schliekelman 1 LSDIS Lab, University of Georgia 2 Computer Science Department, North Carolina State University 3 Kno.e.sis Center, Wright State University Kno.e.sis 4 Statistics Department, University of Georgia The European Semantic Web Conference, Tenerife, Spain. June 1 – 5, This work is funded by NSF-ITR-IDM Award # and #071444

2 Select ?university ?professor where { ?project project_director ?professor. ?project spans ?research_area. ?research_area name “Semantic Web”. ?university employs ?professor. ?university located_in ?location. ?location name “USA”. } ?uni?proj employs ?prof project_director ?proj SW ?prof project_director spans ?prof ?uni USA employs located_in ?uni?proj employs ?prof project_director SW spans ?uni?proj employs ?prof project_director USA located_in ?university ?location ?project ?research_area ?professor Semantic Web USA name project_director located_inspans employs Optimizing Graph Pattern Queries ?university ?location ?project ?research_area ?professor Semantic Web USA name project_director located_inspans employs

3 Proposition and Challenges Go beyond maintaining statistics of triple patterns to maintaining those of more complex graph patterns Number of all graph patterns may be exponential Consider graph patterns of up to a fixed length (maxL) Representation structure for patterns such that patterns can be –Pruned to fit a specified budget, while preserving accuracy of estimates as much as possible –Tuned such that certain patterns are favored over others

4 conference1 pcmember1 submittedTo author2 authorOf publication1 pcMemberOf author1 authorOf author3 authorOf Canonical Label – DFS Coding(Yan et al) (1, 2, 100, 10, 13)3 (1, 2, 101, 12, 11)1 (1, 2, 102, 13, 11)1 (1, 2, 100, 10, 13) (3, 2, 100, 10, 13)3 (1, 2, 100, 10, 13) (2, 3, 102, 13, 11)3 (1, 2, 101, 12, 11) (3, 2, 102, 13, 11)1 (1, 2, 100, 10, 13) (3, 2, 100, 10, 13) (4, 2, 100, 10, 13)1 (1, 2, 100, 10, 13) (3, 2, 100, 10, 13) (2, 4, 102, 13, 11)3 (1, 2, 100, 10, 13) (2, 3, 102, 13, 11) (4, 3, 101, 12, 11)3

5 (1, 2, 100, 10, 13) (1, 2, 101, 12, 11) (1, 2, 102, 13, 11) (3, 2, 100, 10, 13)(2, 3, 102, 13, 11)(3, 2, 102, 13, 11) (4, 2, 100, 10, 13)(2, 4, 102, 13, 11) (4, 3, 101, 12, 11) Pattern Tree (P-Tree) (1, 2, 100, 10, 13)3 (1, 2, 101, 12, 11)1 (1, 2, 102, 13, 11)1 (1, 2, 100, 10, 13) (3, 2, 100, 10, 13)3 (1, 2, 100, 10, 13) (2, 3, 102, 13, 11)3 (1, 2, 101, 12, 11) (3, 2, 102, 13, 11)1 (1, 2, 100, 10, 13) (3, 2, 100, 10, 13) (4, 2, 100, 10, 13)1 (1, 2, 100, 10, 13) (3, 2, 100, 10, 13) (2, 4, 102, 13, 11)3 (1, 2, 100, 10, 13) (2, 3, 102, 13, 11) (4, 3, 101, 12, 11)3

6 Estimation from the P-Tree Patterns of length at most maxL –Traverse the tree, matching labels on query pattern to node labels For a pattern P of length k, k > maxL –Partition into non-disjoint patterns of length maxL, P 1, P 1, …, P k- maxL+1 –P i intersects P i+1 in all but one edge –Combine frequency of partitions under the conditional independence assumption

7 Pruning the P-Tree Estimation value of patterns(nodes) in the P-Tree –Number of children that can be estimated within some error bound –Entropy of the frequency distribution of its children Prune children of nodes with larger estimation values

8 Tuning the P-Tree Observed value –Assume importance threshold is given, we measure as a function of the number of patterns that are less important Final value then combines estimation and observed values Combination is such that the final value of any important node always exceeds that of an unimportant one

9 Maximal Dependence Tree (MD-Tree) –Tree representation of a statistical model of patterns cardinalities Base MD-Tree – Independence assumption –Edge patterns occur independently on any position in patterns of a given length Refined MD-Tree – Single point of dependence assumption –For patterns of a given length, there exists a position that exerts the most influence on the occurrence of edge patterns on others Complete MD-Tree – Completely Refined MD-Tree

Estimation from the MD-Tree Patterns of length at most maxL, we combine statistics from the MD-Tree –Under the independence assumption –Under the single point of dependence assumption Patterns of length k, k > maxL –Partition into non-disjoint patterns of length maxL as before –Estimate using conditional independence 10

11 Pruning the MD-Tree Explore the space between the base and Complete MD-Tree Pick the MD-Tree that –best fits the budget –favors subtrees with wider deviations from the estimation assumptions

12 Tuning the MD-Tree Pick the MD-Tree that –best fits the budget –favors subtrees with wider deviations from the estimation assumption –favors subtrees created from a larger number of important patterns

13 Evaluation SwetoDBLPTOntoGen Number of nodes Number of edges Number of unique edge labels879 Size of patterns of length  3 (bytes) Size of unpruned P-Tree (bytes) (95% reduction)4916(55% reduction) Size of unpruned MD-Tree (bytes) (95% reduction)7554(31% reduction) SwetoDBLP – from LSDIS, part of the DBLP (RDF) enhanced to include more relationships amongst entities. Follows a Zipfian distribution TOntogen – from LSDIS, random node-degree distribution

14 SwetoDBLP 10KB summary – 4% of original P-Tree/MD-Tree with 20% of queries in workload estimated with 0 error and 40%  KB summary – 20% of original P-Tree/MD-Tree with 50% of queries in workload estimated with 0 error and 70%  32

15 TOntogen 1000 bytes summary – 20% of original P-Tree/MD-Tree –P-Tree: 20% of queries in workload are estimated with 0 error and 25%  bytes summary – 30% of original P-Tree/MD-Tree –p-Tree: 40% of queries in workload are estimated with 0 error and 45%  32

16 Summaries tuned for Frequent Patterns P-Tree is more amenable to tuning than MD-Tree, with 90% of all queries in workload estimated with 0 error for SwetoDBLP dataset and 50% estimated with 0 error for the Military dataset SwetoDBLP TOntoGen

Conclusion Frequency of graph patterns are useful for query optimization Two representation structures, P-Tree and MD-Tree –With pruning to fit a specified budget –Tuning to favor certain patterns Although P-Tree exhibits better performance in terms of accuracy of estimates, in more recent experiments, MD- Tree performed equally well for optimizing graph pattern queries in almost all tested cases Expensive discovery of patterns is done offline as a pre- processing step 17

18 A comprehensive evaluation of the effectiveness of our summaries for query processing More compact data structure to reduce the space overhead of the MD-Tree Estimating patterns in graphs whose nodes/edges may be arranged in subsumption hierarchies. Extend to gracefully accommodate updates to the data graph into the summaries. Future Work