1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By:

1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By: Maria Fragouli Athens 2002

2 Web repository:   stores, manages large collections of web pages,   is used by applications that access, mine or index up-to-date web content Basic implementation goals:   Scalability: use of network disks to hold the repository so that it can scale to web growth,   Streams: support of streaming (ordered) access mode (cmp to random access mode) for requests of pages in bulk (cmp to individual pages requests)   Large updates: new updated version of pages must efficiently replace older ones   Expunging Pages: obsolete pages need to be detected and removed

3 We study:   Repository architecture for required functionality – performance   Distribution policies of web pages across network disks   Interaction of crawler-repository   Organization strategies of web pages on system nodes   Experimental results of simulations on prototype WebBase: prototype repository – Stanford University Design Assumptions for the Web repository  Incremental crawler: only new or changed web pages are visited at each run  Retain only the latest version of each page.  Crawl and store only HTML pages  Snapshot index construction

4 WebBase Architecture I Functional modules and their interaction

5 WebBase Architecture II Functional modules and their interaction   Crawler module: retrieves new or updated copies of web pages   Storage module: assigns pages to storage devices, handles updates of pages, schedules, services requests, etc.   Metadata-Indexing module: indexes pages and metadata extracted from them   Query engine:   Multicast module: handle web content according to access mode on pages

6 Access Modes  Random access: pages retrieved using their URL  Query-based access: pages retrieved as responses to queries on pages metadata or textual content (handled by query engine)  Streaming access: pages retrieved and delivered as a data stream to requesting applications (handled by multicast module)  Streams available not only locally but to remote applications as well  Restartable streams, can be paused and resumed at will Page Identifier  The page URL is first normalized:  The resulting text string is hashed using a signature computation to yield a 64-bit page identifier (signature collisions unlikely to occur).  Removal of the protocol prefix  Removal of the port number specification  Conversion of the server name to lower case  Removal of all trailing slashes ("/")

7 Storage Manager (SM)   Stores only latest versions of web pages – provides facilities for their access/update   Consistency of indexes must be dealt with   Expunging of obsolete pages is assisted by the allowed lifetime and lifetime count values associated with each page For scalability:  SM is distributed across a collection of storage nodes  Storage nodes are coordinated by a central node management server  The latter keeps a table of parameters concerning current state of each storage node (node capacity, extent of node fragmentation, state, # of requests) Crawler Stream requests Random access requests Node mgmt server LAN

8 Design issues for SM – I. Page Distribution across nodes   Uniform distribution: all nodes are treated identically   Hash distribution: pages are stored on the nodes whose range of identifiers include the page identifier Uniform vs Hash distribution  Requires global index (mapping of pageID ->nodeID)  Simple node addition  More robust to failures  Sparse global index (fixed pageID- nodeID relationship)  Need for “extensible hashing”  Special recovery measures required

9 i. i. Hash-based organization   Each disk is considered as a collection of hash buckets   Pages are stored into buckets according to the pageID range they hold   Bucket overflows are handled by allocation of extra overflow buckets   We assume that -buckets with successive ranges of pageIDs are physically continuous on disk, -pages are stored in the buckets in increasing order of their IDs Design issues for SM – II. Organization of pages on disk  How the fundamental operations are performed  Random page access: identify containing bucket->read it into memory->main memory search to locate page  Streaming: sequentially read buckets into memory->transmit pages to client  Page addition: in-order or not in-memory addition of pages in buckets->disk write of modified buckets

10 ii. ii. Log-based organization   New pages received are appended at the end of the log  How the fundamental operations are performed  Random page access: requires two disk accesses  Streaming: read sequentially the log for valid pages  Page addition: pages are added to the log, catalog and B-tree modifications are periodically flushed to disk Log Append pages Catalog Disk Basic objects on disk:  Log: includes pages allocated at disk  Catalog: contains entries with useful info (pageID, ptr to physical location of page in log, pagesize, pagestatus, timestamp of page addition) for each page in the log  B-tree index in case of random access mode

11 Classification of pages in repository:   Class A: includes old versions of pages that will be replaced   Class B: unchanged pages   Class C: unseen pages or new versions of pages that will replace class A pages General update process:   Receive class C pages from the crawler and add them to the repository.   Rebuild all the indexes using the class B and C pages.   Delete the class A pages. Suggested update strategies: i. Batch update ii. Incremental update Design issues for SM – III. Update Schemes

12 i. Batch update scheme Two sets of storage nodes: update nodes (hold class C pages), read nodes (hold class A, B pages) Steps followed: System isolation Page transfer System restart

13 Examples of page transfer in the batch update scheme 1. Log-structured page organization and Hash distribution policy on both sets of nodes  Deletion of class A pages requires a separate step Crawler 4 Update nodes 12 Read nodes...... Transmission of class C pages streams Distribution of pages by their pID

14 2. Hash-based page organization and Hash distribution policy on both sets   Deletion of class A pages occurs while class C pages are added   This addition is performed using merge sort Advantages: no conflicts occur, physical location of pages is not changed (compaction operation=part of the update)

15 ii. Incremental update scheme All nodes are equally responsible for supporting both page update and access at the same timecontinuous service provision Drawbacks of continuous service   Performance penalty: due to conflicts between various operations   Requirement for maintaining local index in a dynamic way   Restartable streams are more complicated -in batch update systems, the pair (Node-id, Page- id), provides sufficient information for their state -in incremental update systems where physical locations of pages may change, additional stream state information is required

16 Experiments  WebBase prototype SM’s configuration features  Batch update strategy  Hash page distribution for both update and read nodes  Log-structured page organization in both sets of nodes Implemented on top of a standard Linux FS  SM is fed 50-100 pages/sec from an incremental crawler  Use of a cluster of PCs connected by a 100 Mbps Ethernet LAN  A client module to request access on the repository and a crawler emulator to retrieve/transmit pages accordingly are also implemented Performance Metrics  Page addition rate (pages/sec/node)  Streaming rate (pages/sec/node)  Random access rate (pages/sec/node)  Batch update time (in case of batch update systems) Batch [U(hash,log), R(hash, log)]

17 - - if on the average 16 pages are kept per bucket, a hash bucket size of 64 KB must be chosen and thus the average random page access time would be 20,7 ms (optimal point, plot A) - - As buckets grow, space utilization and streaming performance improve, but random access suffers Optimal hash bucket size Space-performance tradeoff Choosing a hash bucket size

18 Hashed-log hybrid node organization: the disk contains a number of large logs (8-10MB), each one associated with a range of hash values Comparing different systems Performance MetricLog-structured (pages/sec) Hash-based (pages/sec) Hashed-log (pages/sec) Streaming rate and ordering6300 unsorted3900 sorted6300 sorted Random page access rate355135 Page addition rate (random order, no buffering) 61002353 Page addition rate (random order, 10MB buffer) 610035660 Page addition rate (sorted order, 10MB buffer) 61001300

19 System configurationPage addition rate [pages/sec/node] Batch update time (update ratio=0.25) Batch[U(hash, log), R(hash, hash)]610011700 secs Batch[U(hash, hash), R(hash, hash)]351260 secs Batch[U(hash, hashed-log), R(hash, hash)] 6601260 secs Assumption: 25% of the pages on read nodes are replaced by newer versions during the update process (update ratio=0.25) Comparing different configurations

20 Experiments on overall system performance of prototype Performance MetricObserved value Streaming rate2800 pages/sec (per read node) Page addition rate3200 pages/sec (per update node) Batch update time2451 seconds (for update ratio = 0.25) Random page access rate33 pages/sec (per read node) Batch update time of prototype and Performance of prototype

21 System configuration Stream Random access Page addition Update time Incr [hash, log]+---inapplicable Incr [uniform, log]+--+inapplicable Incr [hash, hash]++-inapplicable Batch [U(hash, log), R(hash, log)] ++- +- Batch [U(hash, log), R(hash, hash)] ++++-- Batch [U(hash, hash), R(hash, hash)] ++-+ Batch [U(hashed-log, hash), R(hash, hash)] +++-+ Summary - Relative performance of different system configurations  Ordering of symbols adopted from the most to the least favorable: ++,+,+-,-,--

22 We provided overview of:  WebBase prototype architecture  Performance metrics based on simulation experiments  WebBase being considered as a research test-bed for various system configurations Future enhancements on WebBase include:  Implementation of advanced system configurations  Development of advanced streaming facilities (e.g. deliver streams for subsets of web pages on repository)  Integration of a history maintaining service for old-replaced web pages Conclusions

1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By:

Similar presentations

Presentation on theme: "1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By:

Similar presentations

Presentation on theme: "1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By:"— Presentation transcript:

Similar presentations

About project

Feedback