Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Inverted Index Hongning Wang
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
Architecture of a Search Engine
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Hinrich Schütze and Christina Lioma
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie.
Overview of Search Engines
Help People Find What They Don ’ t Know Hao Ma CSE, CUHK.
Information Retrieval Space occupancy evaluation.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Search Engine Interfaces search engine modus operandi.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Search Engines By: Faruq Hasan.
A Taxonomy of Web Searches Andrei Broder, SIGIR Forum, 2002 Ahmet Yenicag Ceyhun Karbeyaz.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Evidence from Content INST 734 Module 2 Doug Oard.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Information Retrieval On the use of the Inverted Lists.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
CS315 Introduction to Information Retrieval Boolean Search 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Large Scale Search: Inverted Index, etc.
Search Engine Architecture
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Lecture 1: Introduction and the Boolean Model Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Implementation Issues & IR Systems
Data Mining Chapter 6 Search Engines
Index construction: Compression of postings
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Index construction: Compression of postings
Presentation transcript:

Web Algorithmics Web Search Engines

Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query : paradigm “bag of words” Relevant ?!? Goal of a Search Engine

Two main difficulties The Web: Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data AltaVista, Excite, Lycos, etc 1998: Google Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] Google, Yahoo, MSN, ASK,………

This is a search engine!!!

Web Algorithmics The structure of a Search Engine

The structure Web Crawler Page archive Control Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer

Problem: Indexing Consider Wikipedia En: Collection size ≈ 10 Gbytes # docs ≈ 4 * 10 6 #terms in total > 1 billion (avg term len = 6 chars) #terms distinct = several millions Which kind of data structure do we build to support word-based searches ?

DB-based solution: Term-Doc matrix 1 if play contains word, 0 otherwise #terms > 1M #docs ≈ 4M Space ≈ 4Tb !

Current solution: Inverted index Brutus the Calpurnia Currently they get 2  4% original text A term like Calpurnia may use log 2 N bits per occurrence A term like the should take about 1 bit per occurrence

Gap-coding for postings Sort the docIDs Store gaps between consecutive docIDs: Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Two advantages: Space: store smaller integers (clustering?) Speed: query requires just a scan

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

Rice code (simplification of Golomb code) It is a parametric code: depends on k Quotient q=  (v-1)/k , and the rest is r= v – k * q – 1 Useful when integers concentrated around k How do we choose k ? Usually k  0.69 * mean(v) [Bernoulli model] Optimal for Pr(x) = p (1-p) x-1, where mean(x)=1/p, and i.i.d ints [q times 0s] 1 Log k bits

PForDelta coding 1011 … … a block of 128 numbers Use b (e.g. 2) bits to encode 128 numbers or create exceptions Several approaches to encode exceptions Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data

Interpolative coding  = M = Recursive coding  preorder traversal of a balanced binary tree At every step we know: lowest possible value, highest possible value, number of values, i.e. num = |M| = 12, low = 1, hi = 21 Take the middle element: h=6  M[6]=9 It is 1+5 = 6 ≤ M[h] ≤ 21 – (12 – 6) = 15 We can encode 9 in log 2 (15-6) = 4 bits lo=1, hi=8, num = 5 lo=10, hi=21, num = 6

Query processing 1)Retrieve all pages matching the query Brutus the Caesar

Some optimization Best order for query processing ? Shorter lists first… Brutus Caesar Calpurnia Query: Brutus AND Calpurnia AND Caesar

Expand the posting lists with word positions to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Larger space occupancy, about *10 on Web Phrase queries

Query processing 1)Retrieve all pages matching the query 2)Order pages according to various scores:  Term position & freq (body, title, anchor,…)  Link popularity  User clicks or preferences Brutus the Caesar

Generating the snippets !

The big fight: find the best ranking...

Ranking: Google vs Google.cn