Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

General Topic: Spidering the Web Motivation: Acquiring a Collection Bare Essentials of Graph Theory Web Spidering Algorithms Web Spidering: Current Practice

Acquiring a Collection (1) Revising the Total IR Scheme 1. Acquire the collection, i.e. all the documents [Off-line process] 2. Create an inverted index (Homework 1) [Off-line process] 3. Match queries to documents (Homework 2) [On-line process, the actual retrieval] 4. Present the results to user [On-line process: display, summarize,...]

Acquiring a Collection (2) Document Collections and Sources Fixed, pre-existing document collection e.g., the classical philosophy works Pre-existing collection with periodic updates e.g., the MEDLINE biomedical collection Streaming data with temporal decay e.g., the Wall-Street financial news feed Distributed proprietary document collections See Prof. Callan’s methods Distributed, linked, publicly-accessible documents e.g. the Web

Technical Detour: Properties of Graphs I (1) Definitions: Graph a set of nodes n and a set of edges (binary links) v between the nodes. Directed graph a graph where every edge has a pre- specified direction.

Technical Detour: Properties of Graphs I (2) Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other. The web graph the directed graph where n = {all web pages} and v = {all HTML-defined links from one web page to another}.

Technical Detour: Properties of Graphs I (3) Tree a connected graph without any loops and with a unique path between any two nodes Spanning tree of graph G a tree constructed by including all n in G, and a subset of v such that G remains connected, but all loops are eliminated.

Technical Detour: Properties of Graphs I (4) Forest a set of trees (without inter-tree links) k-Spanning forest Given a graph G with k connected subgraphs, the set of k trees each of which spans a different connected subgraph.

Graph G =

Directed Graph Example

Web Graph HTML references are links Web Pages are nodes

Technical Detour: Properties of Graphs II (1) Theorem 1: For every connected graph G, there exists a spanning tree. Proof: Depth-first search starting at any node in G builds the spanning tree.

Technical Detour: Properties of Graphs II (2) Theorem 2: For every G with k disjoint connected subgraphs, there exists a k- spanning forest. Proof: Each connected subgraph has a spanning tree (Theorem 1), and the set of k spanning trees (being disjoint) define a k- spanning forest.

Technical Detour: Properties of Graphs II (3) Additional Observations The web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a- priori the structure of the web-graph). If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF."

Graph-Search Algorithms I PROCEDURE SPIDER 1 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION While STACK is not empty, URL curr := pop(STACK) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION What is wrong with the above algorithm?

Depth-first Search 1 2 3 4 5 6 7 numbers = order in which nodes are visited

Graph-Search Algorithms II (1) SPIDER 1 is Incorrect What about loops in the web graph? => Algorithm will not halt What about convergent DAG structures? => Pages will replicated in collection => Inefficiently large index => Duplicates to annoy user

Graph-Search Algorithms II (2) SPIDER 1 is Incomplete Web graph has k-connected subgraphs. SPIDER 1 only reaches pages in the the connected web subgraph where ROOT page lives.

Graph-Search Algorithms III A Correct Spidering Algorithm PROCEDURE SPIDER 2 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION While STACK is not empty, | Do URL curr := pop(STACK) | Until URL curr is not in COLLECTION PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Graph-Search Algorithms IV A More Efficient Correct Algorithm PROCEDURE SPIDER 3 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION | Initialize VISITED While STACK is not empty, | Do URL curr := pop(STACK) | Until URL curr is not in VISITED | insert-hash(URL curr, VISITED) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Graph-Search Algorithms V A More Complete Correct Algorithm PROCEDURE SPIDER 4 (G, {SEEDS}) |Initialize COLLECTION |Initialize VISITED | For every ROOT in SEEDS |Initialize STACK | Let STACK := push(ROOT, STACK) While STACK is not empty, Do URL curr := pop(STACK) Until URL curr is not in VISITED insert-hash(URL curr, VISITED) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Graph-Search Algorithms VI Completeness Observations (1) Completeness is not guaranteed In k-connected web G, we do not know k Impossible to guarantee each connected subgraph is sampled Better: more seeds, more diverse seeds

Graph-Search Algorithms VI Completeness Observations (2) Search Engine Practice Wish to maximize subset of web indexed. Maintain (secret) set of diverse seeds (grow this set opportunistically, e.g. when X complains his/her page not indexed). Register new web sites on demand New registrations are seed candidates.

To Spider or not to Spider? (1) User Perceptions Most annoying: Engine finds nothing (too small an index, but not an issue since 1998 or so). Somewhat annoying: Obsolete links => Refresh Collection by deleting dead links (OK if index is slightly smaller) => Done every 1-2 weeks in best engines Mildly annoying: Failure to find new site => Re-spider entire web => Done every 2-4 weeks in best engines

To Spider or not to Spider? (2) Cost of Spidering Semi-parallel algorithmic decomposition Spider can (and does) run in hundreds of severs simultaneously Very high network connectivity (e.g. T3 line) Servers can migrate from spidering to query processing depending on time-of-day load Running a full web spider takes days even with hundreds of dedicated servers

Current Status of Web Spiders (1) Historical Notes WebCrawler: first documented spider Lycos: first large-scale spider Top-honors for most web pages spidered: First Lycos, then Alta Vista, then Google...

Current Status of Web Spiders (2) Enhanced Spidering In-link counts to pages can be established during spidering. Hint: In SPIDER 4, store pair in VISITED hash table. In-link counts are the basis for GOOGLE’s page-rank method

Current Status of Web Spiders (3) Unsolved Problems Most spidering re-traverses stable web graph => on-demand re-spidering when changes occur Completeness or near-completeness is still a major issue Cannot Spider JAVA-triggered or local-DB stored information

Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Similar presentations

Presentation on theme: "Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Similar presentations

Presentation on theme: "Building Web Spiders Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback