Information Retrieval and Web Search

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
(c) Maria Indrawan Distributed Information Retrieval.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 Searching the Web Baeza-Yates Modern Information Retrieval, 1999 Chapter 13.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Recuperação de Informação B Cap : Ranking : Crawling the Web : Indices December 06, 1999.
Document Indexing: SPIMI
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Information Retrieval CSE 8337 Spring 2005 Web Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
Search Engine Marketing Science Writers Conference 2009.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
CSCE 590 Web Scraping – Information Extraction II
Search Engine Architecture
Query processing: phrase queries and positional indexes
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Methods and Apparatus for Ranking Web Page Search Results
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
WIRED Week 2 Syllabus Update Readings Overview.
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Web Search Engines.
Query processing: phrase queries and positional indexes
Information Retrieval and Web Design
INF 141: Information Retrieval
Presentation transcript:

Information Retrieval and Web Search Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/

Outline More Web Search Issues Web search engines Technological background Hardware Distributed and positional inverted index Advanced search capabilities

Interfaces Query interface Answer interface Simple box where you type a bag of words Advanced search: boolean operators, phrase search, wild cards, etc. Answer interface A (ranked) list of documents URL Size The date the page was indexed A small fragment of the document

Ranking Based on index not on the real documents Hard to compare different search engines Always improving Recall is hard to measure

Ranking Yuwono and Lee proposed three ranking algorithms Standard Boolean and Vectorial extended to pages that link to or are linked from pages in the answer set Most-cited: ranking based on terms in pages that link to pages in the answer set

Algorithms based on Hyperlink Structure Based on Prestige Principle: a page is popular if many other pages link to it Query-based WebQuery: the answer set is ranked based on how connected that page is HITS Query-ignorant PageRank

HITS Hyperlinked Induced Topic Search (Kleinberg 1999) authorities (many incoming links) hubs (many outgoing links) S: set of pages that link to or are linked from pages in the answer set; V are pages, i.e. vertices, in this set (graph)

Page Rank Larry Page and then Sergey Brin “PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important"

PageRank (cont’d)

Random Walk Algorithms Usually applied on directed graphs From a given vertex, the walker selects at random one of the out-edges Given G = (V,E) a directed graph with vertices V and edges E In(Vi) = predecessors of Vi Out(Vi) = successors of Vi d – damping factor [0,1] (usually 0.85)

Crawling Strategies: Seed URLs Divide by country (.de; .it) Depth-first Breadth-first Divide by country (.de; .it)

Indices Variants of inverted file 50Gb needed to store descriptions of 100 mil pages 500 bytes per URL + description (title + few headings)

Browsing Use web directories Yahoo Directory Google Directory Based on Open Directory Project

Meta-Search Engines Search engine that passes query to several other search engines and integrate results Submit queries to host sites Parse resulting HTML pages to extract search results Integrate multiple rankings into a “consensus” ranking Present integrated results to user Examples: Metacrawler SavvySearch Dogpile

WWW Search Engines Challenges Importance Huge document set Dynamic collection Very large number of users Different media types and formats Etc. Importance Gateway to the WWW Jobs for us 

Search engine usage 7 billion searches in February 2007 Some search engines get around 3 billion searches per month

Computational and storage considerations The web is growing at an increased rate Indexing time for reported pages growing Considerable computational cost Google uses approx 450.000 servers – to handle approx. 3 billion queries per month, and build/store the index Storage Relatively easier: 2003 estimate 170 Tbytes in surface web Search engines usually index most of surface web and some deep web (e.g. phone books, etc.) Google is estimated to index about 8 billion pages Most search engines cache all/some pages Response has to be virtually immediate

Distributed indexing technology Individual machines are fault-prone Can unpredictably slow down or fail Maintain a master machine directing the indexing job – considered “safe” Break up indexing into sets of (parallel) tasks Master machine assigns each task to an idle machine from a pool

Parallel tasks Uses two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs

Parallel tasks Parser writes pairs into j partitions Each for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion

Data flow Master assign assign Postings Parser a-f g-p q-z Inverter splits Inverter q-z Parser a-f g-p q-z

Inverters Collect all (term, doc) pairs for a partition Sorts and writes to postings list Each partition contains a set of postings

User interfaces Principle of least astonishment – users expect to see their search terms on the page How does this relate to the vector space model? What is the other option? Boolean (mixture) Simplicity A single text box creates less confusion Presenting the results Rank by relevance… Provide snippets

Advanced search features In case you never noticed

Advanced features Semantically related words E.g. in Google the “~” operator, as in “California ~hiking” “Hiking” matches “outdoors”, “trail”, etc. Boolean operators (AND, OR, NOT) Search specific document parts: e.g. title, keywords, URL, etc. Site restrictions (search only specific sites) Phrase search Proximity search

Proximity and phrase search Phrase search is one of the few advanced features frequently used by average users (some studies say 10%) Most search engines: double quote strings e.g. “Natural Language Processing” Proximity search: NEAR keyword (AltaVista): Natural NEAR Processing wildcard search (Google): “Natural * Processing”, “Pirates * Caribbean” – wildcard * matches multiple words

Positional Inverted index Required for phrase search (e.g. “Information Retrieval”) Store the position of the word in document Increases index size up to 2-4 times the size of a non-positional index, or 30-50% of the original text Needs to index all stopwords Standard in most search engines

Positional inverted index Store, for each term, entries of the form: <number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>

Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? Can compress position values/offsets Nevertheless, this expands postings storage substantially

Processing a phrase query Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with “to be or not to be”. to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches

Efficient merging with skip pointers 16 128 When we get to 16 on the top list, we see that its successor is 32. 128 2 4 8 16 32 64 8 31 But the skip successor of 8 on the lower list is 31, so we can skip ahead past the intervening postings. 31 1 2 3 5 8 17 21 Suppose we’ve stepped through the lists until we process 8 on each list.

Query processing Some search engines use some form of lemmatization or stemming Plural of nouns Morphological variations In Google it doesn’t work only for English Case sensitivity Most engines are case insensitive Stopword removal

Web search engine rankings An unknown weighted combination of features Link analysis Page Rank Yahoo also uses weight information from their directory structure Content analysis Think Vector Space Model with Boolean constraints Special weights for different document parts Page title Keywords

More Special Features Hyperlink anchor text Term proximity Higher rank if search terms appeared in anchor texts linking to the page Google bombing: a large number of Web pages with links that point to a specific Web site so that the site will appear at the top Term proximity Higher rank if search terms appear in close proximity of each other in the text Domain name and URL And some features hidden by the secrecy of search engines…

Search engine features comparison Source: www.searchengineshowdown.com

Summary Web Search Tech

Next More on Web Search Text Categorization