Google and Scalable Query Services

Slides:

Advertisements

Similar presentations

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.

The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Google and Scalable Query Services

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

Overview of Search Engines

The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.

Web Search Algorithms By Matt Richard and Kyle Krueger.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.

Search Xin Liu.

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Google, Web Crawling, and Distributed Synchronization Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 1, 2008.

“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.

The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.

1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.

1 CS 430: Information Discovery Lecture 20 Web Search Engines.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,

The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.

Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Information Retrieval in Practice

Why indexing? For efficient searching of a document

Information Retrieval in Practice

Search Engine Architecture

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in

Implementation Issues & IR Systems

The Anatomy Of A Large Scale Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Hongjun Song Computer Science The University of Memphis

Anatomy of a search engine

Selected Topics: External Sorting, Join Algorithms, …

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI

Web Search Engines.

The Search Engine Architecture

The Gamma Database Machine Project

Presentation transcript:

Google and Scalable Query Services Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 29, 2008

Administrivia Next reading: Olston: Pig Latin Please write a review of this paper Please send me an email update of your project status by Tuesday 1PM

Databases and Search Eric Brewer, UC Berkeley and founder, Inktomi: Database management systems are not the right tool for running a search engine Why? Scale-up – they don’t focus on thousands of machines Overheads – SQL parsing, locking, … Most of the features aren’t used – transactions, joins, …

So How Would One Build a Search System from the Ground Up? Fortunately, some database PhD students at Stanford tried to do that…

Google Architecture [Brin/Page 98] Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic project @ Stanford; became a startup Our discussion will be on early Google – today they keep things secret!

Google: A Very Specialized DBMS Commodity, cheap hardware Unreliable Not very powerful A fair amount of memory, reasonable hard disks Lots of racks Special air conditioning, power systems, big net pipes Special queries, no need for locking Partitioning of service between different versions: The version being crawled and fleshed out The version being searched (Really, different pieces can be crawled & updated at different times)

What Does Google Need to Do? Scalable crawling of documents Archival of documents (“cache”) Inverted indexing Duplicate removal Ranking – requires iteration over link structure PageRank TF/IDF Heuristics Do the new Google services change any of that? Some may not need the crawler, e.g., maps, perhaps Froogle

The Heart of Google Storage The main database: Repository Basically, a warehouse of every HTML page (this is the cached page entry), compressed in zlib Useful for doing additional processing, any necessary rebuilds Repository entry format: [DocID][ECode][UrlLen][PageLen][Url][Page] The repository is indexed (not inverted here)

Repository Index One index for looking up documents by DocID Done in ISAM (think of this as a B+ Tree without smart re-balancing) Index points to repository entries (or to URL entry if not crawled) One index for mapping URL to DocID Sorted by checksum of URL Compute checksum of URL, then binsearch by checksum Allows update by merge with another similar file

Lexicon The list of searchable words As of 1998, 14 million “words” (Presumably, today it’s used to suggest alternative words as well) The “root” of the inverted index As of 1998, 14 million “words” Kept in memory (was 256MB) Two parts: Hash table of pointers to words and the “barrels” (partitions) they fall into List of words (null-separated)

Indices – Inverted and “Forward” Inverted index divided into “barrels” / “shards” (partitions) Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document Forward index uses the same barrels Used to find multi-word queries with words in same barrel Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs Two barrels: short (anchor and title); full (all text) original tables from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm

Hit Lists (Not Mafia-Related) Used in inverted and forward indices Goal was to minimize the size – the bulk of data is in hit entries For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): Plain cap 1 font: 3 position: 12 vs. Fancy cap 1 font: 7 type: 4 position: 8 special-cased to: Anchor cap 1 font: 7 type: 4 hash: 4 pos: 4

Google’s Search Algorithm Parse the query Convert words into wordIDs Seek to start of doclist in the short barrel for every word Scan through the doclists until there is a document that matches all of the search terms Compute the rank of that document If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough If not at the end of any doclist, goto step 4 Sort the documents by rank; return the top K

Ranking in Google Considers many types of information: Position, font size, capitalization Anchor text PageRank Done offline, in a non-query-sensitive way Count of occurrences (basically, TF) in a way that tapers off Multi-word queries consider proximity also

Why Isn’t Google Based on a DBMS? Transactional locking is not necessary anywhere in the system Helps with partitioning and replication Main memory indexing on lexicon Unusual query model – what’s special here? Weird consistency model! OK if different users see different views As long as we route same user to same machine(s), we’re OK Updates are happening in a separate “instance” Slipstream it in place Can even extend this to change versions of software on the machines – as long as interfaces stay the same

Could We Change a DBMS? What would a DBMS for Google-like environments look like? What would it be useful for, other than Google?

Beyond Google What if we wanted to: Add on-the-fly query capabilities to Google? e.g., query over up-to-the-second stock market results Use WordNet or some thesaurus to supplement Google? Do PageRank in a topic-specific way? Supplement Google with “ontology” info? Do some sort of XML path matching along with keywords? Allow for OLAP-style analysis? Do a cooperative, e.g., P2P, Google? Benefits of this?

Beyond Search… … Can we think about programming custom, highly parallel apps on a Google-like cluster? “Cloud computing” MapReduce / Hadoop Pig Latin Dryad Sawzall EC2 Windows Azure