Download presentation
Presentation is loading. Please wait.
Published bySheena Robinson Modified over 9 years ago
1
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
2
What is search? What are we searching for? How many searches are processed per day? What is the average number of words in text-based searches?
3
Applications and varieties of search: Web search Site search Vertical search Enterprise search Desktop search As-you-type search Proximity search search
6
Relevance Search results contain information the searcher was looking for Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”) User relevance Search results relevant to one user may be completely irrelevant to another user SNOOKI
7
Precision Proportion of retrieved documents that are relevant How precise were the results? Recall (and coverage) Proportion of relevant documents that were actually retrieved Did we retrieve all of the relevant documents? http://trec.nist.gov
8
Timeliness and freshness Search results contain information that is current and up-to-date Performance Users expect subsecond response times Media User devices are constantly changing (cellphones, mobile devices, tablets, etc.)
9
Scalability Designs that perform equally well as the system grows and expands ▪ Increased number of documents, number of users, etc. Flexibility (or adaptability) Tune search engine components to keep up with changing landscape Spam-resistance
10
Gerard Salton (1927-1995) Pioneer in information retrieval Defined information retrieval as: “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information” This was 1968 (before the Internet and Web!)
11
Structured information: Often stored in a database Organized via predefined tables, columns, etc. Select all accounts with balances less than $200 Unstructured information Document text (headings, words, phrases) Images, audio, video (often relies on textual tags) account numberbalance 7004533711$498.19 7004533712$781.05 7004533713$147.15 7004533714$195.75
12
Search and IR has largely focused on text processing and documents Search typically uses the statistical properties of text Word counts Word frequencies But ignore linguistic features (noun, verb, etc.)
13
Web crawlers adhere to a politeness policy: GET requests sent every few seconds or minutes A robots.txt file specifies what crawlers are allowed to crawl:
14
default priority is 0.5 some URLs might not be discovered by crawler
15
what about checking for updated pages?
16
Freshness is essentially a Boolean value Age measures the degree to which crawled page is out of date
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.