Download presentation
Presentation is loading. Please wait.
1
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University
2
General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)
3
Page Repository Page Repository = A scalable storage system for managing large collections of web pages Why do we need one? - Local copy of subset to create the index - Realize cache function to illustrate the status of a web page at the time of indexing and to show textual excerpts in the results list
4
Page Repository = A scalable storage system for managing large collections of web pages Why do we need one? - Local copy of subset to create the index - Realize cache function to illustrate the status of a web page at the time of indexing and to show textual excerpts in the results list Required functionality / interfaces : - Interface to the Crawler to store new or updated pages - Interface to the Indexer Module to create and update the index - Interface to the Query Engine to represent result pages (or parts of it) Page Repository Page Repository = A scalable storage system for managing large collections of web pages Why do we need one? - Local copy of subset to create the index - Realize cache function to illustrate the status of a web page at the time of indexing and to show textual excerpts in the results list
5
Problems and Challenges Scalability : Its size requires a distribution over a cluster of computers and disks Different access modes are required from the interfaces, e.g. random access for fast access to a particular page for result representation and streaming access for efficient access to a larger subset for indexing Large updates are required due to the size and high rate of change of the web (avoid conflicts) Obsolete pages need to be identified and removed
6
Architecture and resulting requirements Because of the size of the repository: Distribution over several Storage Nodes (or Network Disks ) A storage manager is responsible for a) The distribution of the pages to the storage nodes b) The physical organization of the pagers within one storage node c) The update mechanism and the used strategy
7
a) Distribution of pages to storage nodes Different strategies exist, e.g. Uniform distribution policy: Equal treatment of all storage nodes, i.e. assignment of pages to random nodes Advantages: - Adding new pages is easy - Robust against failure of single nodes Hash distribution policy: Assignment of pages to nodes based on some hash strategy, e.g. allocation of certain intervals of a page identifier to specific nodes Advantage: Easy and fast access
8
b) Physical organization within one node Operations supported by one node: - Adding new pages - Access to existing pages a) via random access and b) via streaming access Different strategies exist, e.g. Hash-based organization, e.g. distribution of pages in single buckets Log-structured page organization with - log containing all pages - catalog containing information about pages - b-tree index mapping the page identifiers to the respective physical position (rand. acc.)
9
c) Update mechanism and strategy Depends on the crawler: - Incremental vs. periodic crawler - Batch mode vs. steady crawler Based on the crawler implementation: Update in-place or via shadowing Advantages of shadowing: - Strict separation of update and access - Better performance, easier implementation Advantages of in-place updates: - Better freshness because of lower delay between crawling and update
10
PAGE REPOSITORY General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1) STORAGE MANAGER
11
Page Repository of the 1st Google Version - Full HTML of every web page - Compressed using zlib cf. [2], Section 4.2.2 PAGEURLPAGELENURLLENECODEDOCID PACKET: (STORED COMPRESSED IN REPOSITORY) REPOSITORY: 53,5 GB = 147,8 GB UNCOMPRESSED COMPRESSED PACKETLENGTHSYNC... COMPRESSED PACKETLENGTHSYNC
12
References - Page Repository [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 3 (Storage) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4.2.2 (Repository) [3] HIRAI, RAGHAVAN, GARCIA-MOLINA, PAEPCKE: "WEBBASE: A REPOSITORY OF WEB PAGES", WWW 2000
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.