Download presentation
Presentation is loading. Please wait.
1
Google and Scalable Query Services
Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 29, 2008
2
Administrivia Next reading: Olston: Pig Latin Please write a review of this paper Please send me an update of your project status by Tuesday 1PM
3
Databases and Search Eric Brewer, UC Berkeley and founder, Inktomi: Database management systems are not the right tool for running a search engine Why? Scale-up – they don’t focus on thousands of machines Overheads – SQL parsing, locking, … Most of the features aren’t used – transactions, joins, …
4
So How Would One Build a Search System from the Ground Up?
Fortunately, some database PhD students at Stanford tried to do that…
5
Google Architecture [Brin/Page 98]
Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic Stanford; became a startup Our discussion will be on early Google – today they keep things secret!
6
Google: A Very Specialized DBMS
Commodity, cheap hardware Unreliable Not very powerful A fair amount of memory, reasonable hard disks Lots of racks Special air conditioning, power systems, big net pipes Special queries, no need for locking Partitioning of service between different versions: The version being crawled and fleshed out The version being searched (Really, different pieces can be crawled & updated at different times)
7
What Does Google Need to Do?
Scalable crawling of documents Archival of documents (“cache”) Inverted indexing Duplicate removal Ranking – requires iteration over link structure PageRank TF/IDF Heuristics Do the new Google services change any of that? Some may not need the crawler, e.g., maps, perhaps Froogle
8
The Heart of Google Storage
The main database: Repository Basically, a warehouse of every HTML page (this is the cached page entry), compressed in zlib Useful for doing additional processing, any necessary rebuilds Repository entry format: [DocID][ECode][UrlLen][PageLen][Url][Page] The repository is indexed (not inverted here)
9
Repository Index One index for looking up documents by DocID
Done in ISAM (think of this as a B+ Tree without smart re-balancing) Index points to repository entries (or to URL entry if not crawled) One index for mapping URL to DocID Sorted by checksum of URL Compute checksum of URL, then binsearch by checksum Allows update by merge with another similar file
10
Lexicon The list of searchable words As of 1998, 14 million “words”
(Presumably, today it’s used to suggest alternative words as well) The “root” of the inverted index As of 1998, 14 million “words” Kept in memory (was 256MB) Two parts: Hash table of pointers to words and the “barrels” (partitions) they fall into List of words (null-separated)
11
Indices – Inverted and “Forward”
Inverted index divided into “barrels” / “shards” (partitions) Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document Forward index uses the same barrels Used to find multi-word queries with words in same barrel Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs Two barrels: short (anchor and title); full (all text) original tables from
12
Hit Lists (Not Mafia-Related)
Used in inverted and forward indices Goal was to minimize the size – the bulk of data is in hit entries For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): Plain cap 1 font: 3 position: 12 vs. Fancy cap 1 font: 7 type: position: 8 special-cased to: Anchor cap 1 font: 7 type: hash: pos: 4
13
Google’s Search Algorithm
Parse the query Convert words into wordIDs Seek to start of doclist in the short barrel for every word Scan through the doclists until there is a document that matches all of the search terms Compute the rank of that document If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough If not at the end of any doclist, goto step 4 Sort the documents by rank; return the top K
14
Ranking in Google Considers many types of information:
Position, font size, capitalization Anchor text PageRank Done offline, in a non-query-sensitive way Count of occurrences (basically, TF) in a way that tapers off Multi-word queries consider proximity also
15
Why Isn’t Google Based on a DBMS?
Transactional locking is not necessary anywhere in the system Helps with partitioning and replication Main memory indexing on lexicon Unusual query model – what’s special here? Weird consistency model! OK if different users see different views As long as we route same user to same machine(s), we’re OK Updates are happening in a separate “instance” Slipstream it in place Can even extend this to change versions of software on the machines – as long as interfaces stay the same
16
Could We Change a DBMS? What would a DBMS for Google-like environments look like? What would it be useful for, other than Google?
17
Beyond Google What if we wanted to:
Add on-the-fly query capabilities to Google? e.g., query over up-to-the-second stock market results Use WordNet or some thesaurus to supplement Google? Do PageRank in a topic-specific way? Supplement Google with “ontology” info? Do some sort of XML path matching along with keywords? Allow for OLAP-style analysis? Do a cooperative, e.g., P2P, Google? Benefits of this?
18
Beyond Search… … Can we think about programming custom, highly parallel apps on a Google-like cluster? “Cloud computing” MapReduce / Hadoop Pig Latin Dryad Sawzall EC2 Windows Azure
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.