Search Engines CS 186 Guest Lecture Prof. Marti Hearst SIMS.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan1 Web Search Engines Chapter 27, Part C Based on Larson and Hearsts slides at UC-Berkeley
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Web Intelligence Text Mining, and web-related Applications
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Web Search Engines 198:541 Based on Larson and Hearst’s slides at UC-Berkeley /is202/f00/
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
SLIDE 1IS 202 – FALL 2004 Lecture 05: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
SLIDE 1IS Fall 2003 Course Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am.
SLIDE 1IS Fall 2002 Course Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am.
SLIDE 1IS 202 – FALL 2003 Lecture 21: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Advanced Multimedia Text Classification Tamara Berg.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Web- and Multimedia-based Information Systems Lecture 2.
Search Engines By: Faruq Hasan.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
CS 430: Information Discovery
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Data Mining Chapter 6 Search Engines
Web Search Engines.
Instructor : Marina Gavrilova
Presentation transcript:

Search Engines CS 186 Guest Lecture Prof. Marti Hearst SIMS

Web Search Questions How do search engines differ from DBMSs? What do people search for? How do search engines work? Interfaces Ranking Architecture

Web Search vs DBMS?

A Comparison Web Search Imprecise Ranked results “Satisficing” results Unedited content Keyword queries Mainly Read-only Inverted index DBMS Precise Usually unordered Complete results Controlled content SQL Reads and Writes B-trees

What Do People Search for on the Web?

Genealogy/Public Figure:12% Computer related:12% Business:12% Entertainment: 8% Medical: 8% Politics & Government 7% News 7% Hobbies 6% General info/surfing 6% Science 6% Travel 5% Arts/education/shopping/images 14% Something is missing… Study by Spink et al., Oct 98 Survey on Excite, 13 questions Data for 316 surveyswww.shef.ac.uk/~is/publications/infres/paper53.html

4660 sex 3129 yahoo 2191 internal site admin check from kho 1520 chat 1498 porn 1315 horoscopes 1284 pokemon 1283 SiteScope test 1223 hotmail 1163 games 1151 mp weather maps 1036 yahoo.com 983 ebay 980 recipes l 50,000 queries from Excite, 1997 l Most frequent terms: What Do People Search for on the Web?

Why do these differ? Self-reporting survey The nature of language Only a few ways to say certain things Many different ways to express most concepts UFO, Flying Saucer, Space Ship, Satellite How many ways are there to talk about history?

Intranet Queries (Aug 2000) 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid

Intranet Queries Summary of sample data from 3 weeks of UCB queries 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) 6.7% Schedule of classes or final exams (6222) 5.4% Summer Session (5041) 3.2% Extension (2932) 3.1% Academic Calendar (2846) 2.4% Directories (2202) 1.7% Career Center (1588) 1.7% Housing (1583) 1.5% Map (1393) Average query length over last 4 months: 1.8 words This suggests what is difficult to find from the home page

Different kinds of users; different kinds of data Legal and news colleciton: professional searchers paying (by the query or by the minute) Online bibliographic catalogs (melvyl) scholars searching scholarly literature Web Every type of person with every type of goal No “driving school” for searching

Different kinds of information needs; different kinds of queries Example: Search on “Mazda” –What does this mean on the web? –What does this mean on a news collection? Example: “Mazda transmissions” Example: “Manufacture of Mazda transmissions in the post-cold war world”

Web queries are SHORT ~2.4 words on average (Aug 2000) Has increased, was 1.7 (~1997) User Expectations Many say “the first item shown should be what I want to see”! This works if the user has the most popular/common notion in mind Web Queries

Recent statistics from Inktomi, August 2000, for one client, one week Total # queries: Number of repeated queries: Number of queries with repeated words: Average words/ query: 2.39 Query type: All words: ; Any words: ; Some words: Boolean: ( AND / OR / NOT) Phrase searches: URL searches: URL searches w/http: searches: Wildcards: ( '?'s ) fraction '?' at end of query: interrogatives when '?' at end:

How to Optimize for Short Queries? Find good starting places User still has to search at the site itself Dialogues Build upon a series of short queries Not well understood how to do this for the general case Question Answering AskJeeves – hand edited Automated approaches are under development Very simple Or domain-specific

How to Find Good Starting Points? Manually compiled lists Directories e.g., Yahoo, Looksmart, Open directory Page “popularity” Frequently visited pages (in general) Frequently visited pages as a result of a query Link “co-citation”, which sites are linked to by other sites? Number of pages in the site Not currently used (as far as I know)

Directories vs. Search Engines An IMPORTANT Distinction Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the query by relevance rankings or other scores

Link Analysis for Starting Points Assumptions: If the pages pointing to this page are good, then this is also a good page. The words on the links pointing to this page are useful indicators of what this page is about. References: Page et al. 98, Kleinberg 98

Co-Citation Analysis Has been around since the 50’s. (Small, Garfield, White & McCain) Used to identify core sets of authors, journals, articles for particular fields Not for general search Main Idea: Find pairs of papers that cite third papers Look for commonalitieis A nice demonstration by Eugene Garfield at: –

Link Analysis for Starting Points Why does this work? The official Toyota site will be linked to by lots of other official (or high-quality) sites The best Toyota fan-club site probably also has many links pointing to it Less high-quality sites do not have as many high- quality sites linking to them

Co-citation analysis (From Garfield 98)

Link Analysis for Starting Points Does this really work? Actually, there have been no rigorous evaluations Seems to work for the primary sites; not clear if it works for the relevant secondary sites One (small) study suggests that sites with many pages are often the same as those with good link co-citation scores. (Terveen & Hill, SIGIR 2000)

What is Really Being Used? Todays search engines combine these methods in various ways Integration of Directories Today most web search engines integrate categories into the results listings Lycos, MSN, Google Link analysis Google uses it; others are using it or will soon Words on the links seems to be especially useful Page popularity Many use DirectHit’s popularity rankings

Ranking Algorithms

The problem of ranking Cat cat cat Dog dog dog Fish fish fish Cat cat cat Orangutang Fish Query: cat dog fish orangutang Which is the best match?

Assigning Weights to Terms Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole Automatically derived thesaurus terms

Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

Assigning Weights Goal: give more weight to terms that are Common in THIS document Uncommon in the collection as a whole The tf x idf measure: term frequency (tf) inverse document frequency (idf)

Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating point Each vector holds a place for every term in the collection Therefore, most vectors are sparse

Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

Vector Space Model Documents are represented as vectors in term space Terms are usually stems Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

tf x idf

Computing Similarity Scores

The results of ranking Cat cat cat Dog dog dog Fish fish fish Cat cat cat Orangutang Fish Query: cat dog fish orangutang What does vector space ranking do?

High-Precision Ranking Proximity search can help get high- precision results if > 1 term Hearst ’96 paper: Combine Boolean and passage-level proximity Proves significant improvements when retrieving top 5, 10, 20, 30 documents Results reproduced by Mitra et al. 98 Google uses something similar

What is Really Being Used? Lots of variation here Pretty messy in many cases Details usually proprietary and fluctuating Combining subsets of: Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information

Web Spam Spam: Undesired content Web Spam: Content disguised as something it is not: Be retrieved more often than it otherwise would Be retrieved in contexts that it otherwise would not be retrieved in

Web Spam What are the types of Web spam? Add extra terms to get a higher ranking Repeat “cars” thousands of times Add irrelevant terms to get more hits Put a dictionary in the comments field Put extra terms in the same color as the background of the web page Add irrelevant terms to get different types of hits Put “sex” in the title field in sites that are selling cars Add irrelevant links to boost your link analysis ranking There is a constant “arms race” between web search companies and spammers

Inverted Index This is the primary data structure for text indexes Main Idea: Invert documents into a big index Basic steps: Make a “dictionary” of all the tokens in the collection For each token, list all the docs it occurs in. Do a few things to reduce redundancy in the data structure

Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries Also used for statistical ranking algorithms

Inverted Indexes An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

How Inverted Files are Created Dictionary Postings

Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

Web Search Architecture

Preprocessing Collection gathering phase Web crawling Collection indexing phase Online Query servers

An Example Search System: Cha-Cha A system for searching complex intranets Places retrieval results in context Important design goals: Users at any level of computer expertise Browsers at any version level Computers of any speed

How Cha-Cha Works Crawl the Intranet Compute the shortest hyperlink path from a certain root page to every web page Index and compute metadata for the pages

Cha-Cha System Architecture crawl the web store the documents

Cha-Cha System Architecture crawl the web store the documents create files of metadata Cheshire II

Cha-Cha Metadata Information about web pages Title Length Inlinks Outlinks Shortest Paths from a root home page Used to provide innovative search interface

Cha-Cha System Architecture crawl the web store the documents create files of metadata Cheshire II

Cha-Cha System Architecture crawl the web create a keyword index store the documents create files of metadata Cheshire II

Creating a Keyword Index For each document Tokenize the document Break it up into tokens: words, stems, punctuation There are many variations on this Record which tokens occurred in this document Called an Inverted Index Dictionary: a record of all the tokens in the collection and their overall frequency Postings File: a list recording for each token, which document it occurs in and how often it occurs

Responding to the User Query User searches on “pam samuelson” Search Engine looks up documents indexed with one or both terms in its inverted index Search Engine looks up titles and shortest paths in the metadata index User Interface combines the information and presents the results as HTML

Cha-Cha System Architecture Cheshire II user query

Cha-Cha System Architecture Cheshire II server accesses the databases

Cha-Cha System Architecture Cheshire II results shown to user

Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds

Inverted Indexes for Web Search Engines Inverted indexes for word lists Some systems partition the indexes across different machines; each machine handles different parts of the data Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these

From description of the FAST search engine, by Knut Risvik In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.

Cascading Allocation of CPUs A variation on this that produces a cost- savings: Put high-quality/common pages on many machines Put lower quality/less common pages on fewer machines Query goes to high quality machines first If no hits found there, go to other machines

Web Crawlers How do the web search engines get all of the items they index? Main idea: Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat

Web Crawlers How do the web search engines get all of the items they index? More precisely: Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed: –Record the information found on this page Positions of words, links going out, etc –Add each link on the current page to the queue –Record that this page has been processed In what order should the links be followed?

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: Structure to be traversed

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: Breadth-first search (must be in presentation mode to see this animation)

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: Depth-first search (must be in presentation mode to see this animation)

Web Crawling Issues “Keep-out” signs A file called norobots.txt tells the crawler which directories are off limits Freshness Figure out which pages change often Recrawl these often Duplicates, virtual hosts, etc Convert page contents with a hash function Compare new pages to the hash table Lots of problems Server unavailable Incorrect html Missing links Infinite loops Web crawling is difficult to do robustly!

Commercial Issues General internet search is often commercially driven Commercial sector sometimes hides things – harder to track than research On the other hand, most CTOs for search engine companies used to be researchers, and so help us out Commercial search engine information changes monthly Sometimes motivations are commercial rather than technical

For More Information IS213: Information Organization and Retrieval Modern Information Retrieval, Baeza-Yates and Ribeiro, Addison Wesley, Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in the Proceedings of WWW7 / Computer Networks 30(1-7): , Jurgen Koenemann and Nicholas J. Belkin, A Case for Interaction: A Study of Interactive Information Retrieval Behavior and Effectiveness, in the Proceedings of ACM/CHI, Vancouver, Marti Hearst, Improving Full-Text Precision on Short Queries using Simple Constraints, Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV, April