Download presentation
Presentation is loading. Please wait.
1
Lecture 5: Search Engines
ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
2
Outline Search engines: key tools for ecommerce How do they work?
Buyers and sellers must find each other How do they work? How much do they index? How are hits ordered? Can the order be changed? ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
3
Search Engines Tools for finding information on the Web Directory
Problem: “hidden” databases, e.g. New York Times Directory A hand-constructed hierarchy of topics (e.g. Yahoo) Search engine A machine-constructed index (usually by keyword) So many search engines, we now need search engines to find them. Searchenginecollosus.com ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
4
Indexing Arrangement of data (data structure) to permit fast searching
Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why? Permits binary search. About log2n probes into list log2(1 billion) ~ 30 Permits interpolation search. About log2(log2n) probes log2 log2(1 billion) ~ 5 ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
5
Inverted Files A file is a list of words by position
1 10 20 30 36 A file is a list of words by position First entry is the word in position 1 (first word) Entry 4562 is the word in position 4562 (4562nd word) Last entry is the last word An inverted file is a list of positions by word! a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
6
Inverted Files for Multiple Documents
“jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document DOCID OCCUR POS POS . . . LEXICON WORD INDEX ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
7
Search Engine Architecture
Spider Crawls the web to find pages. Follows hyperlinks. Never stops Indexer Produces data structures for fast searching of all words in the pages Retriever Query interface Database lookup to find hits 2 billion documents 4 TB RAM, many terabytes of disk Ranking ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
8
Crawlers (Spiders, Bots)
Retrieve web pages for indexing by search engines Start with an initial page P0. Find URLs on P0 and add them to a queue When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat Can be specialized (e.g. only look for addresses) Issues Which page to look at next? (Special subjects, recency) Avoid overloading a site How deep within a site to go (drill-down)? How frequently to visit pages? ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
9
Query Specification Boolean Question-answering (simulated)
AND , OR, NOT, PHRASE “ ”, NEAR ~ But keyword query is artificial Question-answering (simulated) “Who offers a master’s degree in ecommerce? Date range Relevance specification In Altavista, can specify terms by importance (separate from query specification) Content multimedia, MP3, .PPT files Stemming: eat, eats, eaten, eating, eater, (ate!) ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
10
“Advanced” Query Specification
Multimedia, e.g. Google Date range Relevance specification In Altavista, can specify terms by importance (separate from query specification) Content multimedia, MP3, .PPT files Stemming Language Search depth (from site’s front page) ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
11
Ranking (Scoring) Hits
Hits must be presented in some order What order? Relevance, recency, popularity, reliability? Some ranking methods Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one) Can the user control? Can the page owner control? Can you find out what order is used? Spamdexing: influencing retrieval ranking by altering a web page. (Puts “spam” in the index) ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
12
Google’s PageRank Algorithm
Assumption: A link in page A to page B is a recommendation of page B by the author of A (we say B is successor of A) The “quality” of a page is related to the number of links that point to it (its in-degree) Apply recursively: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank Algorithm (Brinn & Page, 1998) SOURCE: GOOGLE ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
13
Definition of PageRank
Consider the following infinite random walk (surfing): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1-d The PageRank of a page p is the fraction of steps the surfer spends at p as the number of steps approaches infinity SOURCE: GOOGLE ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
14
PageRank Formula where n is the total number of nodes in the graph
Google uses d 0.85 PageRank is a probability distribution over web pages The sum of all PageRanks of all Pages is 1 SOURCE: GOOGLE ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
15
(1-d)*[(PageRank of A)/4 + (PageRank of B)/3)] + d/n
PageRank Example B A d d P PageRank of P is (1-d)*[(PageRank of A)/4 + (PageRank of B)/3)] + d/n PAGERANK CALCULATOR SOURCE: GOOGLE ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
16
Link Popularity How many pages link to this page?
on the whole Web in our database? Link popularity is used for ranking Many measures Number of links in Weighted number of links in (by weight of referring page) ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
17
Search Engine Sizes (Sept. 2, 2003)
BILLIONS OF PAGES ATW AllTheWeb AV Altavista GG Google INK Inktomi TMA Teoma SEARCHES/DAY (MILLIONS) 2900 per second! SOURCE: SEARCHENGINEWATCH.COM ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
18
Search Engine Usage SHARE BY SEARCH SITE SHARE BY ENGINE
SOURCE: SEARCHENGINEWATCH.COM ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
19
Search Engines Disjointness
Four searches, 10 engines, total of 141 hits on March 6, 2002 SOURCE: SEARCHENGINESHOWDOWN ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
20
SOURCE: SEARCHENGINEWATCH.COM
Search Engine EKG Shows activity of the Lycos crawler at one sample site, calafia.com, by number of pages visited during each crawl SOURCE: SEARCHENGINEWATCH.COM ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
21
Search Engine EKG Comparison
SOURCE: SEARCHENGINEWATCH.COM ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
22
Search Engine Differences
Coverage (number of documents) Spidering algorithms (visit SpiderCatcher) Frequency, depth of visits Inexing policies Search interfaces Ranking One solution: use a metasearcher (search agent) ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
23
Metasearchers All the engines operate differently. Different
sizes query languages crawling algorithms storage policies (stop words, punctuation, fonts) freshness ranking Submit the same query to many engines and collect the results Metacrawler ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
24
Clustering Viewing large numbers of unstructured hits is not useful
Answer: cluster them Vivisimo Kartoo iBoogie SurfWax ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
25
Search Spying Peeking at queries as they are being submitted AllTheWeb
Metaspy. Spies on Metacrawler AskJeeves Epicurious (recipes) StockCharts.com Yahoo buzz index Kanoodle IQSeek ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
26
Time Spent Per Visitor (minutes) by Search Engine, Jan. 2003
Up 58% in ONE YEAR! AJ Ask Jeeves AOL America Online AV Altavista ELNK EarthLink GG Google ISP InfoSpace LS LookSmart LY Lycos MSN Microsoft NS Netscape OVR OVERTURE YH Yahoo SOURCE: SEARCHENGINEWATCH.COM ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
27
Audience Reach by Search Site, Jan, 2003
AJ Ask Jeeves AOL America Online AV Altavista ELNK EarthLink GG Google ISP InfoSpace LS LookSmart LY Lycos MSN Microsoft NS Netscape OVR OVERTURE YH Yahoo Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap SOURCE: SEARCHENGINEWATCH.COM ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
28
Robot Exclusion You may not want certain pages indexed but still viewable by browsers. Can’t protect directory. Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall They look for file robots.txt at highest directory level in domain. If domain is robots.txt goes in A specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX"> ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
29
Robots Exclusion Protocol
Format of robots.txt Two fields. User-agent to specify a robot Disallow to tell the agent what to ignore To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: WebCrawler Disallow: /news/ Disallow: /tmp/ View the robots.txt specification. ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
30
Key Takeaways Engines are a critical Web resource
Very sophisticated, high technology They don’t cover the Web completely Spamdexing is a problem New paradigms needed as Web grows What about images, music, video? Google images ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
31
Q A & ECOMMERCE TECHNOLOGY FALL COPYRIGHT © 2003 MICHAEL I. SHAMOS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.