Download presentation
Presentation is loading. Please wait.
1
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab: http://www.researchchannel.org/prog/displayevent.asp?rid=2459 (c) Wolfgang Hürst, Albert-Ludwigs-University
2
IR vs. Web Search INFORMATION QUERY DATA / DOCUMENTS INFORMATION NEED The web is huge. Very huge. Big variety in dataBig variety in users Doc. authors don't cooperate (spam,...) Users don't cooperate (short queries,...) The no. of users is huge. Very huge... but basic conditions & characteristics differ significantly Initial problem is similar to traditional IR...
3
Classic IR vs. Web Search: Documents Hugh amount of data, continuous growth, high rate of change Hugh variability and heterogeneity - Quality, credibility and reputation of the source - Static vs. dynamic docs - Different media types (text, pics, audio, video) - Different formats (HTML, Flash, PDF,...) - Miscellaneous topics - Continuous text vs. note form / keywords - Different languages, encoding Spam and advertisements Web-specific characteristics - Hypertext, linking - Broken links - Unstructured, not always conform with standards Redundancy (syntactic and semantic) Distributed (need to collect them automatically) Different popularity and access frequency
4
Classic IR vs. Web Search: Users Different needs and aims, e.g. users might want - to learn s.th. ("informational") - to go to a particular site ("navigational") - to do s.th., e.g. shopping, download,... ("transactional") - to do other, miscellaneous things, e.g. finding hubs, "exploratory search",... Different premises, qualifications, languages,... Different network connection / bandwidths Imprecise, unspecific queries Short, ambiguous, inexact, incorrect, no usage of operators or special syntax Classic IR vs. Web Search: Bottom line Different characteristics that cause lots of problems But there's also good news: We can take advantage of some of these characteristics (e.g. links, statistics,...)
5
References [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 1 (Introduction, general architecture) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 1 (Introduction), Chapter 4.1 (Google Architecture Overview)
6
General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)
7
INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION
8
The Google Search Engine Founded 1998 (1996) by two Stanford students Originally academic / research project that later became a commercial tool Distinguishing features (then!?): - Special (and better) ranking - Speed - Size
9
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS DOC INDEXLINKS BARRELS URL RESOLVER LEXICON P AGE R ANK (CF. [2], FIG. 1)
10
Schedule Web Search : - Introduction - Crawling - Page Repository - Indexing - Ranking (PageRank, HITS) - Exercises for web search basics - Advanced / additional web search topics In parallel : - Programming project (Lucene)
11
References [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 1 (Introduction, general architecture) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 1 (Introduction), Chapter 4.1 (Google architecture overview)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.