Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
1 Greedy Forwarding in Dynamic Scale-Free Networks Embedded in Hyperbolic Metric Spaces Dmitri Krioukov CAIDA/UCSD Joint work with F. Papadopoulos, M.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
The Connectivity and Fault-Tolerance of the Internet Topology
Edited by Malak Abdullah Jordan University of Science and Technology Data Structures Using C++ 2E Chapter 12 Graphs.
Small-World Graphs for High Performance Networking Reem Alshahrani Kent State University.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
HCS Clustering Algorithm
CS Lecture 9 Storeing and Querying Large Web Graphs.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
A Resource-level Parallel Approach for Global-routing-based Routing Congestion Estimation and a Method to Quantify Estimation Accuracy Wen-Hao Liu, Zhen-Yu.
Doubling Dimension in Real-World Graphs Melitta Lorraine Geistdoerfer Andersen.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Efficient Gathering of Correlated Data in Sensor Networks
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
+ Mayukha Bairy Disk Intersection graphs and CDS as a backbone in wireless ad hoc networks.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
A Graph-based Friend Recommendation System Using Genetic Algorithm
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
On Node Classification in Dynamic Content-based Networks.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Instance Construction via Likelihood- Based Data Squashing Madigan D., Madigan D., et. al. (Ch 12, Instance selection and Construction for Data Mining.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Efficient Labeling Scheme for Scale-Free Networks The scheme in detailsPerformance of the scheme First we fix the number of hubs (to O(log(N))) and show.
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Exponential random graphs and dynamic graph algorithms David Eppstein Comp. Sci. Dept., UC Irvine.
By Pavan kumar V.V.N. Introduction  Brain’s has extraordinary computational power which is determined in large part by the topology and geometry of its.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Cohesive Subgraph Computation over Large Graphs
SIMILARITY SEARCH The Metric Space Approach
Pastry Scalable, decentralized object locations and routing for large p2p systems.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
A paper on Join Synopses for Approximate Query Answering
Structure and Value Synopses for XML Data Graphs
Probabilistic Data Management
ICICLES: Self-tuning Samples for Approximate Query Answering
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Image Processing for Physical Data
Large Graph Mining: Power Tools and a Practitioner’s guide
Communication and Memory Efficient Parallel Decision Tree Construction
Fast and Exact K-Means Clustering
Asymmetric Transitivity Preserving Graph Embedding
Locality In Distributed Graph Algorithms
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Presented by Ozgur D. Sahin

Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions

Introduction & Motivation Graph-based data is becoming more importatnt  Internet modeling, academic citations, phone records, movie databases, CAD circuits Example Questions:  How robust is the Internet to failures?  What are the most influential database papers?  What is the best opening move in tic-tac-toe?  Are phone call patterns in Asia similar to those in the U.S.? Goal: Quickly answer questions on graph- represented data

Answering Questions We can answer these questions if we can compute following three properties related to connectivity and neighborhood structure:  Graph Similarity: Decide if two graphs have similar connectivity/neighborhood structure  Subgraph Similarity: Compare how two subgraphs of a given graph are connected  Vertex Importance: Assign an importance to each node based on its connectivity This paper provides such a tool: ANF (Approximate Neighborhood Function)

Challenges Following properties should be satisfied:  Error Guarantees: Accurate estimates  Fast: Scale linearly with n (# of nodes) and m (# of edges)  Low Storage  Adapts to available memory  Parallelizable  Sequential scan of the edge file  Estimates per node

Definitions - Neighborhood Functions dist(u,v): # of edges on the shortest path from u to v Define following neighborhood functions:

Definitions - Neighborhood Functions Generalize these two definitions to deal with subgraphs:

Basic ANF Algorithm N(h) can be computed by a graph traversal  Graph traversal accesses edges in random order  Running time is O(nm) Access edges in sequential order:  M(x,h) is the set of nodes within distance h of node x

Basic ANF Algorithm How to compute the number of distinct elements in the set M(x,h):  A dictionary data structure: O(n 2 log n) time/space  Use bits to mark membership: O(n 2 ) space  Use ‘probabilistic counting algorithm’ Approximate set sizes using ‘log n+r’ bits

Probabilistic counting algorithm Approximate set sizes using ‘log n+r’ bits Instead of one bit per node, give half the nodes bit 0, a quarter of them bit 1, and so on (A node is given bit i with probability 1/2 i+1 ) The approximation of the size of a set is proportional to 2 b, where b is the least bit that has not been set in the bit representation of this set Use k parallel approximations  M(x,h) is represented by k(log n+r) bits

Basic ANF Algorithm Consider a ring with 5 nodes  Example for k=3 and r=0  Bit 0 is the leftmost bit in each 3-bit mask M(2,1) is the union of M(2,0), M(1,0), and M(3,0):  M(2,1)=M(2,0) OR M(1,0) OR M(3,0) IN(2,1) is computed from the average of the least zero bit positions:  Avg=(2+1+1)/3=4/3  IN(2,1) = (2 4/3 )/ = 3.25

Basic ANF Algorithm

Modifications M(x,h) uses M(y,h-1) but not M(y,h-2), so just keep the M(y,h-1) during iteration h. Include a mark bit to handle generalized neighborhood functions Break bit masks into smaller pieces if they are larger than the available memory

Leading Ones Compression As ANF runs, most bit masks will have many leading 1’s Compress bit masks by including a counter of the leading ones Bit shuffling of k parallel bit masks enables further compression:  11010,11100  Provides up to 23% speed-up

Experiments Data Sets: 3 real (Router, Cornell, Cora) and 4 synthetic Evaluation Metric:

Experiments - Accuracy k=64: - ANF achieves less than 7% error - ANF’s error is independent of the data set

Experiments - Time

Experiments - Scalability

Data Mining with ANF ANF tool can be used to answer graph mining problems:  Best opening move for Tic-Tac-Toe game  Clustering movie classes  Measuring the robustness of the Internet Use summarized statistics derived from neighborhood function:  Many real graphs follow a power law: N(h)  h H, where H is defined as the ‘hop exponent’ Use ‘individual hop exponent’ as a measure of importance

Tic-Tac-Toe Show: The best opening move is the center square Each possible board configuration is a node and there is an edge from board x to board y if it is a possible move Compute individual neighborhood functions for each of the 9 possible first moves

Clustering Movies Consider IMDB (Internet Movie Data Base) where each movie is identified as being in one or more classes (such as documentaries, dramas, comedies, etc) Construct a graph for each class and cluster similar ones

Internet Router Data How robust the Internet is to router failures  Delete some number of routers and measure connectivity -Random failures do not disrupt the Internet -Targeted failures can dramatically disrupt it

Conclusions ANF uses an efficient and accurate approximation algorithm ANF tool provides several advantages including following:  Accurate  Fast  Low storage requirements  Parallelizable ANF makes it possible to answer many interesting questions