Download presentation
Presentation is loading. Please wait.
2
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines
3
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Outline Search engines: key tools for ecommerce –Buyers and sellers must find each other How do they work? How much do they index? How are hits ordered? Can the order be changed?
4
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engines Tools for finding information on the Web –Problem: “hidden” databases, e.g. New York TimesNew York Times Directory –A hand-constructed hierarchy of topics (e.g. Yahoo)Yahoo Search engine –A machine-constructed index (usually by keyword) So many search engines, we now need search engines to find them. Searchenginecollosus.comSearchenginecollosus.com
5
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Indexing Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why? –Permits binary search. About log 2 n probes into list log 2 (1 billion) ~ 30 –Permits interpolation search. About log 2 (log 2 n) probes log 2 log 2 (1 billion) ~ 5
6
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Inverted Files A file is a list of words by position –First entry is the word in position 1 (first word) –Entry 4562 is the word in position 4562 (4562 nd word) –Last entry is the last word An inverted file is a list of positions by word! POS 1 10 20 30 36 FILE a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE
7
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS 2...... “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56... LEXICON WORD INDEX
8
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Architecture Spider –Crawls the web to find pages. Follows hyperlinks. Never stops Indexer –Produces data structures for fast searching of all words in the pages Retriever –Query interface –Database lookup to find hits 2 billion documents 4 TB RAM, many terabytes of disk –Ranking
9
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Crawlers (Spiders, Bots) Retrieve web pages for indexing by search engines Start with an initial page P 0. Find URLs on P 0 and add them to a queue When done with P 0, pass it to an indexing program, get a page P 1 from the queue and repeat Can be specialized (e.g. only look for email addresses) Issues –Which page to look at next? (Special subjects, recency) –Avoid overloading a site –How deep within a site to go (drill-down)? –How frequently to visit pages?
10
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Query Specification Boolean –AND, OR, NOT, PHRASE “ ”, NEAR ~ –But keyword query is artificial Question-answering (simulated) –“Who offers a master’s degree in ecommerce? Date range Relevance specification –In Altavista, can specify terms by importance (separate from query specification) Content –multimedia, MP3,.PPT files Stemming: eat, eats, eaten, eating, eater, (ate!)
11
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS “Advanced” Query Specification Multimedia, e.g. GoogleGoogle Date range Relevance specification –In Altavista, can specify terms by importance (separate from query specification) Content –multimedia, MP3,.PPT files Stemming Language Search depth (from site’s front page)
12
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Ranking (Scoring) Hits Hits must be presented in some order What order? –Relevance, recency, popularity, reliability? Some ranking methods –Presence of keywords in title of document –Closeness of keywords to start of document –Frequency of keyword in document –Link popularity (how many pages point to this one) Can the user control? Can the page owner control? Can you find out what order is used? Spamdexing: influencing retrieval ranking by altering a web page. (Puts “spam” in the index)
13
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Google’s PageRank Algorithm Assumption: A link in page A to page B is a recommendation of page B by the author of A (we say B is successor of A) The “quality” of a page is related to the number of links that point to it (its in-degree) Apply recursively: Quality of a page is related to – its in-degree, and to – the quality of pages linking to it PageRank Algorithm (Brinn & Page, 1998) PageRank Algorithm SOURCE: GOOGLE
14
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Definition of PageRank Consider the following infinite random walk (surfing): –Initially the surfer is at a random page –At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1-d The PageRank of a page p is the fraction of steps the surfer spends at p as the number of steps approaches infinity SOURCE: GOOGLE
15
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS PageRank Formula where n is the total number of nodes in the graph Google uses d 0.85 PageRank is a probability distribution over web pages The sum of all PageRanks of all Pages is 1 SOURCE: GOOGLE
16
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS PageRank Example P A B PageRank of P is (1-d) [(PageRank of A)/4 + (PageRank of B)/3)] + d/n d d PAGERANK CALCULATOR SOURCE: GOOGLE
17
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Link Popularity How many pages link to this page? –on the whole Web –in our database? www.linkpopularity.com Link popularity is used for ranking –Many measures –Number of links in –Weighted number of links in (by weight of referring page)
18
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Sizes (Sept. 2, 2003) SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM ATW AllTheWeb AVAltavista GGGoogle INK Inktomi TMATeoma SEARCHES/DAY (MILLIONS) 250 80 18 2900 per second! BILLIONS OF PAGES
19
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Usage SHARE BY SEARCH SITE SHARE BY ENGINE SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM
20
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engines Disjointness SOURCE: SEARCHENGINESHOWDOWNSEARCHENGINESHOWDOWN Four searches, 10 engines, total of 141 hits on March 6, 2002
21
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine EKG SOURCE: SEARCHENGINEWATCH.COM Shows activity of the Lycos crawler at one sample site, calafia.com, by number of pages visited during each crawl
22
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine EKG Comparison SOURCE: SEARCHENGINEWATCH.COM
23
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Engine Differences Coverage (number of documents) Spidering algorithms (visit SpiderCatcher)SpiderCatcher –Frequency, depth of visits Inexing policies Search interfaces Ranking One solution: use a metasearcher (search agent)
24
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Metasearchers All the engines operate differently. Different –sizes –query languages –crawling algorithms –storage policies (stop words, punctuation, fonts) –freshness –ranking Submit the same query to many engines and collect the results Metacrawler
25
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Clustering Viewing large numbers of unstructured hits is not useful Answer: cluster them Vivisimo Kartoo iBoogie SurfWax
26
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Search Spying Peeking at queries as they are being submitted AllTheWeb Metaspy. Spies on MetacrawlerMetaspy AskJeeves Epicurious (recipes)Epicurious StockCharts.com Yahoo buzz indexYahoo Kanoodle IQSeek
27
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Time Spent Per Visitor (minutes) by Search Engine, Jan. 2003 SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM AJAsk Jeeves AOLAmerica Online AVAltavista ELNK EarthLink GGGoogle ISPInfoSpace LSLookSmart LYLycos MSN Microsoft NSNetscape OVROVERTURE YHYahoo Up 58% in ONE YEAR!
28
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Audience Reach by Search Site, Jan, 2003 AJAsk Jeeves AOLAmerica Online AVAltavista ELNK EarthLink GGGoogle ISPInfoSpace LSLookSmart LYLycos MSN Microsoft NSNetscape OVROVERTURE YHYahoo Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM
29
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Robot Exclusion You may not want certain pages indexed but still viewable by browsers. Can’t protect directory. Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall They look for file robots.txt at highest directory level in domain. If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt A specific document can be shielded from a crawler by adding the line:
30
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Robots Exclusion Protocol Format of robots.txt –Two fields. User-agent to specify a robot –Disallow to tell the agent what to ignore To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: WebCrawler Disallow: /news/ Disallow: /tmp/ View the robots.txt specification.specification
31
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Key Takeaways Engines are a critical Web resource Very sophisticated, high technology They don’t cover the Web completely Spamdexing is a problem New paradigms needed as Web grows What about images, music, video? –www.corbis.com, Google imageswww.corbis.comimages
32
20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Q A &
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.