Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Slides:



Advertisements
Similar presentations
22C:19 Discrete Structures Induction and Recursion Fall 2014 Sukumar Ghosh.
Advertisements

Depth-First Search1 Part-H2 Depth-First Search DB A C E.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Minimum Spanning Trees Definition Two properties of MST’s Prim and Kruskal’s Algorithm –Proofs of correctness Boruvka’s algorithm Verifying an MST Randomized.
Chapter 8, Part I Graph Algorithms.
CS171 Introduction to Computer Science II Graphs Strike Back.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
May 5, 2015Applied Discrete Mathematics Week 13: Boolean Algebra 1 Dijkstra’s Algorithm procedure Dijkstra(G: weighted connected simple graph with vertices.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
MST Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
May 1, 2002Applied Discrete Mathematics Week 13: Graphs and Trees 1News CSEMS Scholarships for CS and Math students (US citizens only) $3,125 per year.
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
7.1 and 7.2: Spanning Trees. A network is a graph that is connected –The network must be a sub-graph of the original graph (its edges must come from the.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Artificial Intelligence Web Spidering & HW1 Preparation Jaime Carbonell 22 January 2002 Today's Agenda Finish A*, B*, Macrooperators.
A correction The definition of knot in page 147 is not correct. The correct definition is: A knot in a directed graph is a subgraph with the property that.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 Subgraphs A subgraph S of a graph G is a graph such that The vertices of S are a subset of the vertices of G The edges of S are a subset of the edges.
1 Introduction to Graphs Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
SPARSE CERTIFICATES AND SCAN-FIRST SEARCH FOR K-VERTEX CONNECTIVITY
Today Graphical Models Representing conditional dependence graphically
 2004 SDU 1 Lecture5-Strongly Connected Components.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
5. Biconnected Components of A Graph If one city’s airport is closed by bad weather, can you still fly between any other pair of cities? If one computer.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 11. Chapter Summary Introduction to Trees Applications of Trees (not currently included in overheads) Tree Traversal Spanning Trees Minimum Spanning.
CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.
Spanning Trees Alyce Brady CS 510: Computer Algorithms.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Main algorithm with recursion: We’ll have a function DFS that initializes, and then calls DFS-Visit, which is a recursive function and does the depth first.
Introduction to Algorithms
Lecture 12 Algorithm Analysis
Spanning Trees Longin Jan Latecki Temple University based on slides by
Part-D1 Priority Queues
Information Retrieval
What is a Graph? a b c d e V= {a,b,c,d,e} E= {(a,b),(a,c),(a,d),
CS 583 Analysis of Algorithms
Minimum Spanning Tree Section 7.3: Examples {1,2,3,4}
Lecture 12 Algorithm Analysis
Subgraphs, Connected Components, Spanning Trees
ITEC 2620M Introduction to Data Structures
Spanning Trees Longin Jan Latecki Temple University based on slides by
Lecture 12 Algorithm Analysis
Important Problem Types and Fundamental Data Structures
Information Retrieval and Web Design
Minimum Spanning Trees
Presentation transcript:

Building Web Spiders Web-Based Information Architectures MSEC Mini II Jaime Carbonell

General Topic: Spidering the Web Motivation: Acquiring a Collection Bare Essentials of Graph Theory Web Spidering Algorithms Web Spidering: Current Practice

Acquiring a Collection (1) Revising the Total IR Scheme 1. Acquire the collection, i.e. all the documents [Off-line process] 2. Create an inverted index (Homework 1) [Off-line process] 3. Match queries to documents (Homework 2) [On-line process, the actual retrieval] 4. Present the results to user [On-line process: display, summarize,...]

Acquiring a Collection (2) Document Collections and Sources Fixed, pre-existing document collection e.g., the classical philosophy works Pre-existing collection with periodic updates e.g., the MEDLINE biomedical collection Streaming data with temporal decay e.g., the Wall-Street financial news feed Distributed proprietary document collections See Prof. Callan’s methods Distributed, linked, publicly-accessible documents e.g. the Web

Technical Detour: Properties of Graphs I (1) Definitions: Graph a set of nodes n and a set of edges (binary links) v between the nodes. Directed graph a graph where every edge has a pre- specified direction.

Technical Detour: Properties of Graphs I (2) Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other. The web graph the directed graph where n = {all web pages} and v = {all HTML-defined links from one web page to another}.

Technical Detour: Properties of Graphs I (3) Tree a connected graph without any loops and with a unique path between any two nodes Spanning tree of graph G a tree constructed by including all n in G, and a subset of v such that G remains connected, but all loops are eliminated.

Technical Detour: Properties of Graphs I (4) Forest a set of trees (without inter-tree links) k-Spanning forest Given a graph G with k connected subgraphs, the set of k trees each of which spans a different connected subgraph.

Graph G =

Directed Graph Example

Tree

Web Graph HTML references are links Web Pages are nodes

Technical Detour: Properties of Graphs II (1) Theorem 1: For every connected graph G, there exists a spanning tree. Proof: Depth-first search starting at any node in G builds the spanning tree.

Technical Detour: Properties of Graphs II (2) Theorem 2: For every G with k disjoint connected subgraphs, there exists a k- spanning forest. Proof: Each connected subgraph has a spanning tree (Theorem 1), and the set of k spanning trees (being disjoint) define a k- spanning forest.

Technical Detour: Properties of Graphs II (3) Additional Observations The web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a- priori the structure of the web-graph). If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF."

Graph-Search Algorithms I PROCEDURE SPIDER 1 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION While STACK is not empty, URL curr := pop(STACK) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION What is wrong with the above algorithm?

Depth-first Search numbers = order in which nodes are visited

Graph-Search Algorithms II (1) SPIDER 1 is Incorrect What about loops in the web graph? => Algorithm will not halt What about convergent DAG structures? => Pages will replicated in collection => Inefficiently large index => Duplicates to annoy user

Graph-Search Algorithms II (2) SPIDER 1 is Incomplete Web graph has k-connected subgraphs. SPIDER 1 only reaches pages in the the connected web subgraph where ROOT page lives.

Graph-Search Algorithms III A Correct Spidering Algorithm PROCEDURE SPIDER 2 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION While STACK is not empty, | Do URL curr := pop(STACK) | Until URL curr is not in COLLECTION PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Graph-Search Algorithms IV A More Efficient Correct Algorithm PROCEDURE SPIDER 3 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION | Initialize VISITED While STACK is not empty, | Do URL curr := pop(STACK) | Until URL curr is not in VISITED | insert-hash(URL curr, VISITED) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Graph-Search Algorithms V A More Complete Correct Algorithm PROCEDURE SPIDER 4 (G, {SEEDS}) |Initialize COLLECTION |Initialize VISITED | For every ROOT in SEEDS |Initialize STACK | Let STACK := push(ROOT, STACK) While STACK is not empty, Do URL curr := pop(STACK) Until URL curr is not in VISITED insert-hash(URL curr, VISITED) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Graph-Search Algorithms VI Completeness Observations (1) Completeness is not guaranteed In k-connected web G, we do not know k Impossible to guarantee each connected subgraph is sampled Better: more seeds, more diverse seeds

Graph-Search Algorithms VI Completeness Observations (2) Search Engine Practice Wish to maximize subset of web indexed. Maintain (secret) set of diverse seeds (grow this set opportunistically, e.g. when X complains his/her page not indexed). Register new web sites on demand New registrations are seed candidates.

To Spider or not to Spider? (1) User Perceptions Most annoying: Engine finds nothing (too small an index, but not an issue since 1998 or so). Somewhat annoying: Obsolete links => Refresh Collection by deleting dead links (OK if index is slightly smaller) => Done every 1-2 weeks in best engines Mildly annoying: Failure to find new site => Re-spider entire web => Done every 2-4 weeks in best engines

To Spider or not to Spider? (2) Cost of Spidering Semi-parallel algorithmic decomposition Spider can (and does) run in hundreds of severs simultaneously Very high network connectivity (e.g. T3 line) Servers can migrate from spidering to query processing depending on time-of-day load Running a full web spider takes days even with hundreds of dedicated servers

Current Status of Web Spiders (1) Historical Notes WebCrawler: first documented spider Lycos: first large-scale spider Top-honors for most web pages spidered: First Lycos, then Alta Vista, then Google...

Current Status of Web Spiders (2) Enhanced Spidering In-link counts to pages can be established during spidering. Hint: In SPIDER 4, store pair in VISITED hash table. In-link counts are the basis for GOOGLE’s page-rank method

Current Status of Web Spiders (3) Unsolved Problems Most spidering re-traverses stable web graph => on-demand re-spidering when changes occur Completeness or near-completeness is still a major issue Cannot Spider JAVA-triggered or local-DB stored information