Download presentation
Presentation is loading. Please wait.
Published byBeatrix Pitts Modified over 8 years ago
1
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca K.ApazaH@gmail.com San Pablo Catholic University
2
ABSTRACT Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. crawl and index. Technology and web Proliferation. Anyone can publish anything they want.
4
INTRODUCTION Challenges for information retrieval. Google : espelling of googol (10 100 ). Web Search Engines – Escaling Up: – Wold Wide Web Worn (WWWW). Google: Scaling with the Web
5
DESIGN GOALS “The best navigation service should make ir easy to find almost anithyng on the web” “Junk Results” People are still only willing to look at the first few teen of results.
6
Push More Development Understanding into the academic Build sysmes that reasonable numbre of people can use. Support novel activities on large-scale web data.
7
PAGE RANK: Bringing Order to the Web The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.
8
ANCHOR TEXT The text of links is treated in a special way in our search engine. – Accurate – For documents wich cannot be indexed.
9
SYSTEM ANATOMY ARCHITECTURE OVERVIEW
10
MAJOR DATA ESTRUCTURES A disk seek requires about 10 ms to complete. Google avoid disk seeks
11
BIG FILES Virtual files Are adressable by 64 bit integers Handles allocation and deallocation of file descriptors
12
REPOSITORY Contains the full HTML Use ZLIB We can rebuild all the other data estructures from only the repository
13
DOCUMENT INDEX Include: – Document status – Pointer into the repository – Document cheksum – statics Converte URLs into docIDs
14
LEXICON 256 MB main memory Two parts: – List of words – Hash table of pointers
15
Hit List: – List of ocurrences – Use Huffman coding. Forward Index: – 64 barrels Inverted Index: – Barrels processed by the sorter – Sorted by docID
17
CRAWLING THE WEB Involves interacting with hundreds of thousands of web servers and various name servers A single URLserver serves lists of URLs to a number of crawlers. Is implemented in python.
18
INDEXING THE WEB Parcing – Must handle a huge array of possible errors – Use YACC to generate a CFG parser – We use flex to generate a lexical analizer Indexing Documents into Barrels – The words are converted into a wordID Sorting – Generate the inverted index
19
SEARCHING 1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.
20
The Ranking System Factors: – Position – Font – Capiltalization – PageRank – Proximity
21
RESULTS AND PERFORMANCE
23
STORAGE REQUIREMENTS
25
SEARCH TIMES
26
CONCLUSIONS Google is designed to be a escalable search engine. Provide high quality search results over a repidly growing World Wide Web. Google employs a number of techniques to inprove search quality. Is a Architecture for gathering web pages, indexing them, and performing search queries.
27
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca K.ApazaH@gmail.com San Pablo Catholic University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.