Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:

Advertisements

Similar presentations

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

Xyleme A Dynamic Warehouse for XML Data of the Web.

1 WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By:

Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.

1 Searching the Web Representation and Management of Data on the Internet.

Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Overview of Search Engines

Enterprise Search. Search Architecture Configuring Crawl Processes Advanced Crawl Administration Configuring Query Processes Implementing People Search.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

Presented By: - Chandrika B N

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

Some key-value stores using log-structure Zhichao Liang LevelDB Riak.

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.

Search Engine Architecture

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

1 Searching the Web Representation and Management of Data on the Internet.

Serverless Network File Systems Overview by Joseph Thompson.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

Combining Systems and Databases: A Search Engine Retrospective By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Information Retrieval in Practice

Statistics Visualizer for Crawler

Search Engine Architecture

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Yoram Bachrach Yiftah Ben-Aharon

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Junghoo “John” Cho UCLA

Presentation transcript:

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)

Page Repository Page Repository = A scalable storage system for managing large collections of web pages Why do we need one? - Local copy of subset to create the index - Realize cache function to illustrate the status of a web page at the time of indexing and to show textual excerpts in the results list

Page Repository = A scalable storage system for managing large collections of web pages Why do we need one? - Local copy of subset to create the index - Realize cache function to illustrate the status of a web page at the time of indexing and to show textual excerpts in the results list Required functionality / interfaces : - Interface to the Crawler to store new or updated pages - Interface to the Indexer Module to create and update the index - Interface to the Query Engine to represent result pages (or parts of it) Page Repository Page Repository = A scalable storage system for managing large collections of web pages Why do we need one? - Local copy of subset to create the index - Realize cache function to illustrate the status of a web page at the time of indexing and to show textual excerpts in the results list

Problems and Challenges Scalability : Its size requires a distribution over a cluster of computers and disks Different access modes are required from the interfaces, e.g. random access for fast access to a particular page for result representation and streaming access for efficient access to a larger subset for indexing Large updates are required due to the size and high rate of change of the web (avoid conflicts) Obsolete pages need to be identified and removed

Architecture and resulting requirements Because of the size of the repository: Distribution over several Storage Nodes (or Network Disks ) A storage manager is responsible for a) The distribution of the pages to the storage nodes b) The physical organization of the pagers within one storage node c) The update mechanism and the used strategy

a) Distribution of pages to storage nodes Different strategies exist, e.g. Uniform distribution policy: Equal treatment of all storage nodes, i.e. assignment of pages to random nodes Advantages: - Adding new pages is easy - Robust against failure of single nodes Hash distribution policy: Assignment of pages to nodes based on some hash strategy, e.g. allocation of certain intervals of a page identifier to specific nodes Advantage: Easy and fast access

b) Physical organization within one node Operations supported by one node: - Adding new pages - Access to existing pages a) via random access and b) via streaming access Different strategies exist, e.g. Hash-based organization, e.g. distribution of pages in single buckets Log-structured page organization with - log containing all pages - catalog containing information about pages - b-tree index mapping the page identifiers to the respective physical position (rand. acc.)

c) Update mechanism and strategy Depends on the crawler: - Incremental vs. periodic crawler - Batch mode vs. steady crawler Based on the crawler implementation: Update in-place or via shadowing Advantages of shadowing: - Strict separation of update and access - Better performance, easier implementation Advantages of in-place updates: - Better freshness because of lower delay between crawling and update

PAGE REPOSITORY General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1) STORAGE MANAGER

Page Repository of the 1st Google Version - Full HTML of every web page - Compressed using zlib cf. [2], Section PAGEURLPAGELENURLLENECODEDOCID PACKET: (STORED COMPRESSED IN REPOSITORY) REPOSITORY: 53,5 GB = 147,8 GB UNCOMPRESSED COMPRESSED PACKETLENGTHSYNC... COMPRESSED PACKETLENGTHSYNC

References - Page Repository [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG Chapter 3 (Storage) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter (Repository) [3] HIRAI, RAGHAVAN, GARCIA-MOLINA, PAEPCKE: "WEBBASE: A REPOSITORY OF WEB PAGES", WWW 2000