Lecture 5: Search Engines

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Natural Language Processing WEB SEARCH ENGINES August, 2002.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
1 Graphs & more on Web search Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
ECOMMERCE TECHNOLOGY SUMMER 2002 COPYRIGHT © 2002 MICHAEL I. SHAMOS Lecture 5: Search Engines.
(c) Maria Indrawan Distributed Information Retrieval.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.
Search engines. The number of Internet hosts exceeded in in in in in
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
1 Searching the Web Baeza-Yates Modern Information Retrieval, 1999 Chapter 13.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Internet Research Search Engines & Subject Directories.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Searching “Search results are only as good as the query you pose and how you search. There is no silver bullet”
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine Interfaces search engine modus operandi.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Search Engines June 20, 2005 LIBS100 Linda Galloway.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Introduction to Data Structures Vamshi Ambati
Stop Searching and Start FINDING: Strategies for Effective Web Research.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
CPT 499 Internet Skills for Educators Session Three Class Notes.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
1 Introduction to Graphs Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Learning how to search on the web “If all you ever do is all you’ve ever done, then all you’ll ever get is all you’ve ever got.” (author unknown)
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
Dr. Frank McCown Comp 250 – Web Development Harding University
Web Searching Strategies
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Introduction to Graphs
Web Design/Internet Essentials
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Crawlers: Nutch CSE /12/2018 5:08 AM.
Information Retrieval
Lecture 22 SVD, Eigenvector, and Web Search
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Search Engines & Subject Directories
Web Search Engines.
Lecture 22 SVD, Eigenvector, and Web Search
Web Searching Everything, now..
Presentation transcript:

Lecture 5: Search Engines 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Outline Search engines: key tools for ecommerce How do they work? Buyers and sellers must find each other How do they work? How much do they index? How are hits ordered? Can the order be changed? 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engines Tools for finding information on the Web Directory Problem: “hidden” databases, e.g. New York Times Directory A hand-constructed hierarchy of topics (e.g. Yahoo) Search engine A machine-constructed index (usually by keyword) So many search engines, we now need search engines to find them. Searchenginecollosus.com 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Indexing Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why? Permits binary search. About log2n probes into list log2(1 billion) ~ 30 Permits interpolation search. About log2(log2n) probes log2 log2(1 billion) ~ 5 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Inverted Files A file is a list of words by position 1 10 20 30 36 A file is a list of words by position First entry is the word in position 1 (first word) Entry 4562 is the word in position 4562 (4562nd word) Last entry is the last word An inverted file is a list of positions by word! a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Inverted Files for Multiple Documents “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56 . . . DOCID OCCUR POS 1 POS 2 . . . . . . LEXICON WORD INDEX 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Architecture Spider Crawls the web to find pages. Follows hyperlinks. Never stops Indexer Produces data structures for fast searching of all words in the pages Retriever Query interface Database lookup to find hits 2 billion documents 4 TB RAM, many terabytes of disk Ranking 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Crawlers (Spiders, Bots) Retrieve web pages for indexing by search engines Start with an initial page P0. Find URLs on P0 and add them to a queue When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses) Issues Which page to look at next? (Special subjects, recency) Avoid overloading a site How deep within a site to go (drill-down)? How frequently to visit pages? 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Query Specification Boolean Question-answering (simulated) AND , OR, NOT, PHRASE “ ”, NEAR ~ But keyword query is artificial Question-answering (simulated) “Who offers a master’s degree in ecommerce? Date range Relevance specification In Altavista, can specify terms by importance (separate from query specification) Content multimedia, MP3, .PPT files Stemming: eat, eats, eaten, eating, eater, (ate!) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

“Advanced” Query Specification Multimedia, e.g. Google Date range Relevance specification In Altavista, can specify terms by importance (separate from query specification) Content multimedia, MP3, .PPT files Stemming Language Search depth (from site’s front page) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Ranking (Scoring) Hits Hits must be presented in some order What order? Relevance, recency, popularity, reliability? Some ranking methods Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one) Can the user control? Can the page owner control? Can you find out what order is used? Spamdexing: influencing retrieval ranking by altering a web page. (Puts “spam” in the index) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Google’s PageRank Algorithm Assumption: A link in page A to page B is a recommendation of page B by the author of A (we say B is successor of A) The “quality” of a page is related to the number of links that point to it (its in-degree) Apply recursively: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank Algorithm (Brinn & Page, 1998) SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Definition of PageRank Consider the following infinite random walk (surfing): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1-d The PageRank of a page p is the fraction of steps the surfer spends at p as the number of steps approaches infinity SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

PageRank Formula where n is the total number of nodes in the graph Google uses d  0.85 PageRank is a probability distribution over web pages The sum of all PageRanks of all Pages is 1 SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

(1-d)*[(PageRank of A)/4 + (PageRank of B)/3)] + d/n PageRank Example B A d d P PageRank of P is (1-d)*[(PageRank of A)/4 + (PageRank of B)/3)] + d/n PAGERANK CALCULATOR SOURCE: GOOGLE 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Link Popularity How many pages link to this page? on the whole Web in our database? www.linkpopularity.com Link popularity is used for ranking Many measures Number of links in Weighted number of links in (by weight of referring page) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Sizes (Sept. 2, 2003) BILLIONS OF PAGES ATW AllTheWeb AV Altavista GG Google INK Inktomi TMA Teoma SEARCHES/DAY (MILLIONS) 250 80 18 2900 per second! SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Usage SHARE BY SEARCH SITE SHARE BY ENGINE SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engines Disjointness Four searches, 10 engines, total of 141 hits on March 6, 2002 SOURCE: SEARCHENGINESHOWDOWN 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

SOURCE: SEARCHENGINEWATCH.COM Search Engine EKG Shows activity of the Lycos crawler at one sample site, calafia.com, by number of pages visited during each crawl SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine EKG Comparison SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Engine Differences Coverage (number of documents) Spidering algorithms (visit SpiderCatcher) Frequency, depth of visits Inexing policies Search interfaces Ranking One solution: use a metasearcher (search agent) 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Metasearchers All the engines operate differently. Different sizes query languages crawling algorithms storage policies (stop words, punctuation, fonts) freshness ranking Submit the same query to many engines and collect the results Metacrawler 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Clustering Viewing large numbers of unstructured hits is not useful Answer: cluster them Vivisimo Kartoo iBoogie SurfWax 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Search Spying Peeking at queries as they are being submitted AllTheWeb Metaspy. Spies on Metacrawler AskJeeves Epicurious (recipes) StockCharts.com Yahoo buzz index Kanoodle IQSeek 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Time Spent Per Visitor (minutes) by Search Engine, Jan. 2003 Up 58% in ONE YEAR! AJ Ask Jeeves AOL America Online AV Altavista ELNK EarthLink GG Google ISP InfoSpace LS LookSmart LY Lycos MSN Microsoft NS Netscape OVR OVERTURE YH Yahoo SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Audience Reach by Search Site, Jan, 2003 AJ Ask Jeeves AOL America Online AV Altavista ELNK EarthLink GG Google ISP InfoSpace LS LookSmart LY Lycos MSN Microsoft NS Netscape OVR OVERTURE YH Yahoo Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap SOURCE: SEARCHENGINEWATCH.COM 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Robot Exclusion You may not want certain pages indexed but still viewable by browsers. Can’t protect directory. Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall They look for file robots.txt at highest directory level in domain. If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt A specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX"> 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Robots Exclusion Protocol Format of robots.txt Two fields. User-agent to specify a robot Disallow to tell the agent what to ignore To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: WebCrawler Disallow: /news/ Disallow: /tmp/ View the robots.txt specification. 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Key Takeaways Engines are a critical Web resource Very sophisticated, high technology They don’t cover the Web completely Spamdexing is a problem New paradigms needed as Web grows What about images, music, video? www.corbis.com, Google images 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Q A & 20-751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS