An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa.

Slides:



Advertisements
Similar presentations
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Advertisements

1 Chapter 22: Elementary Graph Algorithms IV. 2 About this lecture Review of Strongly Connected Components (SCC) in a directed graph Finding all SCC (i.e.,
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Distance Indexing on Road Networks A summary Andrew Chiang CS 4440.
The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.
The Shape of the Web So, the Web is a directed graph, but what does it look like?
Representing and Using Graphs
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
General Writing - Audience What is their level of knowledge? Advanced, intermediate, basic? Hard to start too basic – but have to use the right terminology.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)
Algorithms for hard problems Parameterized complexity Bounded tree width approaches Juris Viksna, 2015.
Whole Test Suite Generation. Abstract Not all bugs lead to program crashes, and not always is there a formal specification to check the correctness of.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Models of Web-Like Graphs: Integrated Approach
Krishnendu ChatterjeeFormal Methods Class1 MARKOV CHAINS.
CPS120: Introduction to Computer Science Sorting.
Abstract In this paper, the k-coverage problem is formulated as a decision problem, whose goal is to determine whether every point in the service area.
More on Clustering in COSC 4335
Chapter 2 Memory and process management
Lecture 1 (UNIT -4) TREE SUNIL KUMAR CIT-UPES.
Main algorithm with recursion: We’ll have a function DFS that initializes, and then calls DFS-Visit, which is a recursive function and does the depth first.
Chapter 3. Decompositions of Graphs
Multiway Search Trees Data may not fit into main memory
CSC 421: Algorithm Design & Analysis
CSC 421: Algorithm Design & Analysis
Spanning Trees.
Parallel Density-based Hybrid Clustering
CSC 421: Algorithm Design & Analysis
Minimum Spanning Tree 8/7/2018 4:26 AM
Parameterized complexity Bounded tree width approaches
Byung Joon Park, Sung Hee Kim
1.3 Modeling with exponentially many constr.
Implementation of Relational Operations (Part 2)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Objective of This Course
3.5 Minimum Cuts in Undirected Graphs
Virtual Memory Hardware
Chapter 11 I/O Management and Disk Scheduling
Memory management Explain how memory is managed in a typical modern computer system (virtual memory, paging and segmentation should be described.
Analysis of Algorithms
CSC 421: Algorithm Design & Analysis
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Knowledge Representation
CS246: Web Characteristics
Trevor Brown DC 2338, Office hour M3-4pm
External Sorting.
Wednesday, 5/8/2002 Hash table indexes, physical operators
Evolution in memory management techniques
Sorting Algorithms 2.1 – Algorithms.
3.2 Graph Traversal.
File system : Disk Space Management
CSC 421: Algorithm Design & Analysis
Distance-Constraint Reachability Computation in Uncertain Graphs
For Friday Read chapter 9, sections 2-3 No homework
Virtual Memory 1 1.
Graph Traversals Some applications require visiting every vertex in the graph exactly once. The application may require that vertices be visited in some.
Presentation transcript:

An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa

Jong H. Seo - Realtime OS Lab 2 Table Of Contents Abstract and Introduction Related Work The Split-Merge Algorithm Experiments and Results (not covered) Conclusions (not covered)

Jong H. Seo - Realtime OS Lab 3 Abstract and Introduction Web graph and its connectivity Problem recognition and our goal

Jong H. Seo - Realtime OS Lab 4 Web Graph And Its Connectivity World Wide Web –Pages / Hyperlinks Directed Graph (Web Graph) –Nodes / Edges Connectivity analysis is an important part of the research on web graph. To study connectivity and compute structure of the web graph, SCC analysis (SCC enumeration) is most common and important.

Jong H. Seo - Realtime OS Lab 5 Problem Recognition When the graph contains hundreds of millions of nodes and billions of edges, it’s difficult to use traditional algorithm because of intractability for both time and space. We can hardly load the full graph into the main memory because of the large scale of the web graph.

Jong H. Seo - Realtime OS Lab 6 Our Goal In this paper, we investigate some properties of web graph, and propose a feasible algorithm for enumerating its web graph. The algorithm ends in a week while we can hardly apply the traditional algorithm on this web graph as it may run for years.

Jong H. Seo - Realtime OS Lab 7 The Following Sections Section 2 : We review some traditional algorithms for enumerating SCCs in a general directed graph. Section 3 : We describe some special properties of web graph. We propose an algorithm to enumerate SCCs in this graph. Section 4 : We discuss detailed implementation of this algorithm on the web graph in China.

Jong H. Seo - Realtime OS Lab 8 Related Work Broder and Kumar’s web diagram as a bowtie. Tarjan’s algorithm Sharir’s algorithm Lisa. K. Fleischer’s parallel algorithm

Jong H. Seo - Realtime OS Lab 9 Graph Structure In The Web (1/3) CORE : maximum SCC of the graph IN : pages from IN itself at least a path exists to some nodes in CORE OUT : pages which can be reached from some nodes in CORE TENDRILS : pages that are reachable from IN, or that can reach OUT, without passage through CORE.

Jong H. Seo - Realtime OS Lab 10 Graph Structure In The Web (2/3) IN can be viewed as the set of new pages that link to their interesting pages but not yet been discovered by CORE. OUT can be viewed as some well known pages whose links point to internal pages only. TENDRILS can be viewed as the pages have not yet discovered by the web.

Jong H. Seo - Realtime OS Lab 11 Graph Structure In The Web (3/3) The deeper analysis reveals the connectivity of the web graph. If pages u and v are randomly chosen, the probability that there exists a path from u to v is only ¼.

Jong H. Seo - Realtime OS Lab 12 Tarjan’s Algorithm Tarjan presented an algorithm to decompose a directed graph into strongly connected components in O(n+e), where n denotes the number of nodes and e denotes the number of edges.

Jong H. Seo - Realtime OS Lab 13 Sharir’s Algorithm Sharir’s algorithm finds all SCCs in a directed graph in O(n+e) time. He proposed to use the transpose of the original graph.

Jong H. Seo - Realtime OS Lab 14 Fleischer’s Algorithm Divide and Conquer Pred(G, v), Desc(G, v), Rem(G, v) SCC(G, v) = Pred(G, v) ∩ Desc(G, v) This algorithm works efficiently in multiprocessor based on both DFS and BFS.

Jong H. Seo - Realtime OS Lab 15 Introduction To The Split-Merge Algorithm (1/2) The conventional algorithms are not sometimes applicable to the web graph. Web graph consists of several hundreds of millions of nodes and several billions of edges.

Jong H. Seo - Realtime OS Lab 16 Introduction To The Split-Merge Algorithm (2/2) Although machines with 8GB main memory are popular in many organizations involved in web graph research and powerful algorithms of web graph compression are available, sometimes it’s still impossible to load the entire graph into main memory. The link information will be loaded from hard disk to main memory back and forth when traversing the graph. The time cost on I/O is unaffordable. So it’s infeasible to enumerate SCCs in the web graph in a straightforward way.

Jong H. Seo - Realtime OS Lab 17 Basic Idea On Split-Merge Algorithm (1/2) 1.Classify the nodes of graph G into n groups. Build a sub-graph with each group of nodes and the links among them. 2.Decompose each sub-graph into SCCs. If the sub-graph is small enough, use algorithm for enumerating SCCs. Otherwise, recursively apply the split-merge algorithm. 3.Assume each SCC in a sub-graph is a node and eliminate the duplicated links between them. We obtain the contracted graph G’, a graph composed of all the SCCs.

Jong H. Seo - Realtime OS Lab 18 Basic Idea On Split-Merge Algorithm (2/2) 4.Decompose the contracted graph G’ into SCCs. If the G’ is small enough, use any algorithm of enumerating SCCs. Otherwise, recursively apply the split-merge algorithm. 5.Merge the SCCs from sub-graphs with the help of the decomposition of G’.

Jong H. Seo - Realtime OS Lab 19 An Directed Graph G’ A BE F C D G H JI The directed graph G consists of 10 nodes and 15 edges. It will be split into three sub- graphs.

Jong H. Seo - Realtime OS Lab 20 Three Sub-Graphs A BE F C D G H JI The largest sub-graph only contains 4 nodes and 5 edges

Jong H. Seo - Realtime OS Lab 21 The Contracted Graph G’ (1/2) After each sub-graph is decomposed, we can contract the graph as G’. G’ only contains 5 nodes and 6 edges and can be decomposed ((A, B, C), (E, F), (G, H, J), (D)) and (I) A B C E F G H I D I

Jong H. Seo - Realtime OS Lab 22 The Contracted Graph G’ (2/2) By merging the result from last two diagrams, we can enumerate all the SCCs in the original graph G: (A, B, C, E, F, G, H, J, D) and (I).

Jong H. Seo - Realtime OS Lab 23 Pros On Split-Merge Algorithm The scale of both sub-graphs and the contracted graph G’ are much smaller than that of the original graph G. If the web graph is split into sub-graphs, it’s possible to load one entire sub-graph into main memory when decomposing. Thus, the extra cost of split and merge seems to be affordable compared with swapping edges between hard disk and main memory back and forth.

Jong H. Seo - Realtime OS Lab 24 Cons On Split-Merge Algorithm (1/3) A BE F C D G H JI Another way to split graph G

Jong H. Seo - Realtime OS Lab 25 Cons On Split-Merge Algorithm (2/2) A B C I G D J H E F The scale of the contracted graph G’ is only a bit smaller.

Jong H. Seo - Realtime OS Lab 26 Cons On Split-Merge Algorithm (3/3) The basic split-merge algorithm does not work because of the awful split. The scale of G’ is only a bit smaller. Thus, graph G’ should be split again.

Jong H. Seo - Realtime OS Lab 27 What Remains Now? What remains is to find a way to split the web graph appropriately. However, it seems to be difficult to do the job well if only the link information is concerned. We take advantage of special properties of the potential relationship between pages and sites in the web graph.