Search Engines & Question Answering Giuseppe Attardi Università di Pisa.

Search Engines & Question Answering Giuseppe Attardi Università di Pisa

Topics Web Search Web Search –Search engines –Architecture –Crawling: parallel/distributed, focused –Link analysis (Google PageRank) –Scaling

Source: Jupiter Communications, 2000 Top Online Activities

Pew Study (US users July 2002) Total Internet users = 111 M Total Internet users = 111 M Do a search on any given day = 33 M Do a search on any given day = 33 M Have used Internet to search = 85% Have used Internet to search = 85% //www.pewinternet.org/reports/toc.asp?Repor t=64 //www.pewinternet.org/reports/toc.asp?Repor t=64

Search on the Web Corpus:The publicly accessible Web: static + dynamic Corpus:The publicly accessible Web: static + dynamic Goal: Retrieve high quality results relevant to the user’s need Goal: Retrieve high quality results relevant to the user’s need –(not docs!) Need Need –Informational – want to learn about something (~40%) –Navigational – want to go to that page (~25%) –Transactional – want to do something (web-mediated) (~35%) Access a service Downloads Shop –Gray areas Find a good hub Exploratory search “see what’s there” Low hemoglobin United Airlines Tampere weather Mars surface images Nikon CoolPix Car rental Finland

Results Static pages (documents) Static pages (documents) –text, mp3, images, video,... Dynamic pages = generated on request Dynamic pages = generated on request –data base access –“the invisible web” –proprietary content, etc.

Terminology http://www.cism.it/cism/hotels_2001.ht m Host name Page name Access method URL = Universal Resource Locator

Scale Immense amount of content Immense amount of content –2-10B static pages, doubling every 8-12 months –Lexicon Size: 10s-100s of millions of words Authors galore (1 in 4 hosts run a web server) Authors galore (1 in 4 hosts run a web server) http://www.netcraft.com/Survey

Diversity Languages/Encodings Languages/Encodings –Hundreds (thousands ?) of languages, W3C encodings: 55 (Jul01) [W3C01] –Home pages (1997): English 82%, Next 15: 13% [Babe97] –Google (mid 2001): English: 53%, JGCFSKRIP: 30% Document & query topic Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000) 1.8% Regional: Europe 7.2%Business ………… 2.3% Business: Industries 7.3%Recreation 3.2% Computers: Internet 8%Adult 3.4% Computers: Software 8.7%Society 4.4% Adult: Image Galleries 10.3%Regional 5.3% Regional: North America 13.8%Computers6.1% Arts: Music 14.6%Arts

Rate of change [Cho00] 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 Mathematically, what does this seem to be?

Web idiosyncrasies Distributed authorship Distributed authorship –Millions of people creating pages with their own style, grammar, vocabulary, opinions, facts, falsehoods … –Not all have the purest motives in providing high-quality information - commercial motives drive “spamming” - 100s of millions of pages. –The open web is largely a marketing tool. IBM’s home page does not contain computer.

Other characteristics Significant duplication Significant duplication –Syntactic - 30%-40% (near) duplicates [Brod97, Shiv99b] –Semantic - ??? High linkage High linkage –~ 8 links/page in the average Complex graph topology Complex graph topology –Not a small world; bow-tie structure [Brod00] More on these corpus characteristics later More on these corpus characteristics later –how do we measure them?

Web search users Ill-defined queries Ill-defined queries –Short AV 2001: 2.54 terms avg, 80% < 3 words) –Imprecise terms –Sub-optimal syntax (80% queries without operator) –Low effort Wide variance in Wide variance in –Needs –Expectations –Knowledge –Bandwidth Specific behavior Specific behavior –85% look over one result screen only (mostly above the fold) –78% of queries are not modified (one query/session) –Follow links – “the scent of information”...

Evolution of search engines First generation -- use only “on page”, text data First generation -- use only “on page”, text data –Word frequency, language Second generation -- use off-page, web-specific data Second generation -- use off-page, web-specific data –Link (or connectivity) analysis –Click-through data (What results people click on) –Anchor-text (How people refer to this page) Third generation -- answer “the need behind the query” Third generation -- answer “the need behind the query” –Semantic analysis -- what is this about? –Focus on user need, rather than on query –Context determination –Helping the user –Integration of search and text analysis 1995-1997 AV, Excite, Lycos, etc From 1998. Made popular by Google but everyone now Still experimental

Third generation search engine: answering “the need behind the query” Query language determination Query language determination Different ranking Different ranking –(if query Japanese do not return English) Hard & soft matches Hard & soft matches –Personalities (triggered on names) –Cities (travel info, maps) –Medical info (triggered on names and/or results) –Stock quotes, news (triggered on stock symbol) –Company info, … Integration of Search and Text Analysis Integration of Search and Text Analysis

Answering “the need behind the query” Context determination Context determination Context determination –spatial (user location/target location) –query stream (previous queries) –personal (user profile) –explicit (vertical search, family friendly) –implicit (use AltaVista from AltaVista France) Context use Context use –Result restriction –Ranking modulation

The spatial context - geo-search Two aspects Two aspects –Geo-coding encode geographic coordinates to make search effective –Geo-parsing the process of identifying geographic context. Geo-coding Geo-coding –Geometrical hierarchy (squares) –Natural hierarchy (country, state, county, city, zip-codes, etc) Geo-parsing Geo-parsing –Pages (infer from phone nos, zip, etc). About 10% feasible. –Queries (use dictionary of place names) –Users From IP data –Mobile phones In its infancy, many issues (display size, privacy, etc)

AV barry bonds

Lycos palo alto

Geo-search example - Northern Light (Now Divine Inc)

Helping the user UI UI spell checking spell checking query refinement query refinement query suggestion query suggestion context transfer … context transfer …

Context sensitive spell check

Crawl Control Search Engine Architecture Crawlers Ranking Page Repository Query Engine Link Analysis Text Structure Document Store Queries Results Snippet Extraction Indexer

Terms Crawler Crawler Crawler control Crawler control Indexes – text, structure, utility Indexes – text, structure, utility Page repository Page repository Indexer Indexer Collection analysis module Collection analysis module Query engine Query engine Ranking module Ranking module

Repository “Hidden Treasures”

Storage The page repository is a scalable storage system for web pages The page repository is a scalable storage system for web pages Allows the Crawler to store pages Allows the Crawler to store pages Allows the Indexer and Collection Analysis to retrieve them Allows the Indexer and Collection Analysis to retrieve them Similar to other data storage systems – DB or file systems Similar to other data storage systems – DB or file systems Does not have to provide some of the other systems’ features: transactions, logging, directory Does not have to provide some of the other systems’ features: transactions, logging, directory

Storage Issues Scalability and seamless load distribution Scalability and seamless load distribution Dual access modes Dual access modes –Random access (used by the query engine for cached pages) –Streaming access (used by the Indexer and Collection Analysis) Large bulk update – reclaim old space, avoid access/update conflicts Large bulk update – reclaim old space, avoid access/update conflicts Obsolete pages - remove pages no longer on the web Obsolete pages - remove pages no longer on the web

Designing a Distributed Web Repository Repository designed to work over a cluster of interconnected nodes Repository designed to work over a cluster of interconnected nodes Page distribution across nodes Page distribution across nodes Physical organization within a node Physical organization within a node Update strategy Update strategy

Page Distribution How to choose a node to store a page How to choose a node to store a page Uniform distribution – any page can be sent to any node Uniform distribution – any page can be sent to any node Hash distribution policy – hash page ID space into node ID space Hash distribution policy – hash page ID space into node ID space

Organization Within a Node Several operations required Several operations required –Add / remove a page –High speed streaming –Random page access Hashed organization Hashed organization –Treat each disk as a hash bucket –Assign according to a page’s ID Log organization Log organization –Treat the disk as one file, and add the page at the end –Support random access using a B-tree Hybrid Hybrid –Hash map a page to an extent and use log structure within an extent.

Distribution Performance LogHashed Hashed Log Streaming performance ++-+ Random access performance +-+++- Page addition ++-+

Update Strategies Updates are generated by the crawler Updates are generated by the crawler Several characteristics Several characteristics –Time in which the crawl occurs and the repository receives information –Whether the crawl’s information replaces the entire database or modifies parts of it

Batch vs. Steady Batch mode Batch mode –Periodically executed –Allocated a certain amount of time Steady mode Steady mode –Run all the time –Always send results back to the repository

Partial vs. Complete Crawls A batch mode crawler can A batch mode crawler can –Do a complete crawl every run, and replace entire collection –Recrawl only a specific subset, and apply updates to the existing collection – partial crawl The repository can implement The repository can implement –In place update Quickly refresh pages –Shadowing, update as another stage Avoid refresh-access conflicts

Partial vs. Complete Crawls Shadowing resolves the conflicts between updates and read for the queries Shadowing resolves the conflicts between updates and read for the queries Batch mode suits well with shadowing Batch mode suits well with shadowing Steady crawler suits with in place updates Steady crawler suits with in place updates

Indexing

The Indexer Module The Indexer Module Creates Two indexes : Text (content) index : Uses “Traditional” indexing methods like Inverted Indexing Text (content) index : Uses “Traditional” indexing methods like Inverted Indexing Structure(Links( index : Uses a directed graph of pages and links. Sometimes also creates an inverted graph Structure(Links( index : Uses a directed graph of pages and links. Sometimes also creates an inverted graph

The Link Analysis Module Uses the 2 basic indexes created by the indexer module in order to the indexer module in order to assemble “Utility Indexes” assemble “Utility Indexes” e.g. : A site index.

Inverted Index A Set of inverted lists, one per each index term (word) A Set of inverted lists, one per each index term (word) Inverted list of a term: A sorted list of locations in which the term appeared. Inverted list of a term: A sorted list of locations in which the term appeared. Posting: A pair (w,l) where w is word and l is one of its locations Posting: A pair (w,l) where w is word and l is one of its locations Lexicon: Holds all index’s terms with statistics about the term (not the posting) Lexicon: Holds all index’s terms with statistics about the term (not the posting)

Challenges Index build must be: Index build must be: – Fast – Economic (unlike traditional index buildings) (unlike traditional index buildings) Incremental Indexing must be supported Incremental Indexing must be supported Storage: compression vs. speed Storage: compression vs. speed

Index Partitioning A distributed text indexing can be done by: Local inverted file (IF L ) Local inverted file (IF L ) –Each nodes contain disjoint random pages –Query is broadcasted –Result is the joined query answers Global inverted file (IF G ) Global inverted file (IF G ) –Each node is responsible only for a subset of terms in the collection –Query sent only to the apropriate node

Indexing, Conclusion Web pages indexing is complicated due to it’s scale (millions of pages, hundreds of gigabytes) Web pages indexing is complicated due to it’s scale (millions of pages, hundreds of gigabytes) Challenges: Incremental indexing and personalization Challenges: Incremental indexing and personalization

Scaling

Scaling Google (Nov 2002): Google (Nov 2002): –Number of pages: 3 billion –Refresh interval: 1 month (1200 pag/sec) –Queries/day: 150 million = 1700 q/s Avg page size:10KB Avg page size:10KB Avg query size: 40 B Avg query size: 40 B Avg result size: 5 KB Avg result size: 5 KB Avg links/page: 8 Avg links/page: 8

Size of Dataset Total raw HTML data size: Total raw HTML data size: 3G x 10 KB = 30 TB Inverted index ~= corpus = 30 TB Inverted index ~= corpus = 30 TB Using compression 3:1 Using compression 3:1 20 TB data on disk

Single copy of index Index Index –(10 TB) / (100 GB per disk) = 100 disk Document Document –(10 TB) / (100 GB per disk) = 100 disk

Query Load 1700 queries/sec 1700 queries/sec Rule of thumb: 20 q/s per CPU Rule of thumb: 20 q/s per CPU –85 clusters to answer queries Cluster: 100 machine Cluster: 100 machine Total = 85 x 100 = 8500 Total = 85 x 100 = 8500 Document server Document server –Snippet search: 1000 snippet/s –(1700 * 10 / 1000) * 100 = 1700

Limits Redirector: 4000 req/sec Redirector: 4000 req/sec Bandwidth: 1100 req/sec Bandwidth: 1100 req/sec Server: 22 q/s each Server: 22 q/s each Cluster: 50 nodes = 1100 q/s = Cluster: 50 nodes = 1100 q/s = 95 million q/day 95 million q/day

Scaling the Index Hardware based load balancer Google Web Server Index servers Document servers Spell Checker Ad server Queries Google Web Server

Pooled Shard Architecture Web Server … Pool for shard 1 … SIS … Intermediate load balancer 1 Pool for shard N … Intermediate load balancer N Index Server 1 Index load balancer Index Server K S1S1 S1S1 SIS SNSN SNSN Pool Network Index Server Network 1 Gb/s 100 Mb/s

Replicated Index Architecture Web Server Full Index 1 … SIS … Index Server 1 Index Server M Index load balancer S1S1 SIS SNSN Full Index M … SIS S1S1 SNSN Index Server Network 1 Gb/s 100 Mb/s

Index Replication 100 Mb/s bandwidth 100 Mb/s bandwidth 20 TB x 8 / 100 20 TB x 8 / 100 20 TB require one full day 20 TB require one full day

Ranking

First generation ranking Extended Boolean model Extended Boolean model –Matches: exact, prefix, phrase,… –Operators: AND, OR, AND NOT, NEAR, … –Fields: TITLE:, URL:, HOST:,… –AND is somewhat easier to implement, maybe preferable as default for short queries Ranking Ranking –TF like factors: TF, explicit keywords, words in title, explicit emphasis (headers), etc –IDF factors: IDF, total word count in corpus, frequency in query log, frequency in language

Second generation search engine Ranking -- use off-page, web-specific data Ranking -- use off-page, web-specific data –Link (or connectivity) analysis –Click-through data (What results people click on) –Anchor-text (How people refer to this page) Crawling Crawling –Algorithms to create the best possible corpus

Connectivity analysis Idea: mine hyperlink information in the Web Idea: mine hyperlink information in the Web Assumptions: Assumptions: – Links often connect related pages – A link between pages is a recommendation “people vote with their links”

Citation Analysis Citation frequency Citation frequency Co-citation coupling frequency Co-citation coupling frequency –Cocitations with a given author measures “impact” –Cocitation analysis [Mcca90] Bibliographic coupling frequency Bibliographic coupling frequency –Articles that co-cite the same articles are related Citation indexing Citation indexing –Who is a given author cited by? (Garfield [Garf72] ) Pinsker and Narin Pinsker and Narin

Query-independent ordering First generation: using link counts as simple measures of popularity First generation: using link counts as simple measures of popularity Two basic suggestions: Two basic suggestions: –Undirected popularity: Each page gets a score = the number of in- links plus the number of out-links (3+2=5) –Directed popularity: Score of a page = number of its in-links (3)

Query processing First retrieve all pages meeting the text query (say venture capital) First retrieve all pages meeting the text query (say venture capital) Order these by their link popularity (either variant on the previous page) Order these by their link popularity (either variant on the previous page)

Spamming simple popularity Exercise: How do you spam each of the following heuristics so your page gets a high score? Exercise: How do you spam each of the following heuristics so your page gets a high score? Each page gets a score = the number of in-links plus the number of out- links Each page gets a score = the number of in-links plus the number of out- links Score of a page = number of its in- links Score of a page = number of its in- links

PageRank scoring Imagine a browser doing a random walk on web pages: Imagine a browser doing a random walk on web pages: –Start at a random page –At each step, go out of the current page along one of the links on that page, equiprobably “In the steady state” each page has a long-term visit rate - use this as the page’s score “In the steady state” each page has a long-term visit rate - use this as the page’s score 1/3

Not quite enough The web is full of dead-ends The web is full of dead-ends –Random walk can get stuck in dead- ends. –Makes no sense to talk about long-term visit rates. ??

Teleporting At each step, with probability 10%, jump to a random web page At each step, with probability 10%, jump to a random web page With remaining probability (90%), go out on a random link With remaining probability (90%), go out on a random link –If no out-link, stay put in this case

Result of teleporting Now cannot get stuck locally Now cannot get stuck locally There is a long-term rate at which any page is visited (not obvious, will show this) There is a long-term rate at which any page is visited (not obvious, will show this) How do we compute this visit rate? How do we compute this visit rate?

PageRank Tries to capture the notion of “Importance of a page” Tries to capture the notion of “Importance of a page” Uses Backlinks for ranking Uses Backlinks for ranking Avoids trivial spamming: Distributes pages’ “voting power” among the pages they are linking to Avoids trivial spamming: Distributes pages’ “voting power” among the pages they are linking to “Important” page linking to a page will raise it’s rank more the “Not Important” one “Important” page linking to a page will raise it’s rank more the “Not Important” one

Simple PageRank Given by : Given by :Where B(i) : set of pages links to i N(j) : number of outgoing links from j Well defined if link graph is strongly connected Well defined if link graph is strongly connected Based on “Random Surfer Model” - Rank of page equals to the probability of being in this page Based on “Random Surfer Model” - Rank of page equals to the probability of being in this page

Computation Of PageRank (1)

Given a matrix A, an eigenvalue c and the corresponding eigenvector v is defined if Av = cv Given a matrix A, an eigenvalue c and the corresponding eigenvector v is defined if Av = cv Hence r is eigenvector of A t for eigenvalue “1” Hence r is eigenvector of A t for eigenvalue “1” If G is strongly connected then r is unique If G is strongly connected then r is unique Computation of PageRank (2)

Computation of PageRank (3) Simple PageRank can be computed by: Simple PageRank can be computed by:

PageRank Example

Practical PageRank: Problem Web is not a strongly connected graph. It contains: Web is not a strongly connected graph. It contains: –“Rank Sinks”: cluster of pages without outgoing links. Pages outside cluster will be ranked 0. –“Rank Leaks”: a page without outgoing links. All pages will be ranked 0.

Practical PageRank: Solution Remove all Page Leaks Remove all Page Leaks Add decay factor d to Simple PageRank Add decay factor d to Simple PageRank Based on “Board Surfer Model” Based on “Board Surfer Model”

HITS: Hypertext Induced Topic Search A query dependent technique A query dependent technique Produces two scores: Produces two scores: –Authority: A most likely to be relevant page to a given query –Hub: Points to many Authorities Contains two part: Contains two part: –Identifying the focused subgraph –Link analysis

HITS: Identifying The Focused Subgraph Subgraph creation from t-sized page set: (d reduces the influence of extremely (d reduces the influence of extremely popular pages like yahoo.com) popular pages like yahoo.com)

HITS: Link Analysis Calculates Authorities & Hubs scores (a i & h i ) for each page in S Calculates Authorities & Hubs scores (a i & h i ) for each page in S

HITS:Link Analysis Computation Eigenvectors computation can be used by: Eigenvectors computation can be used by:Where a: Vector of Authorities’ scores h: Vector of Hubs’ scores h: Vector of Hubs’ scores A: Adjacency matrix in which a i,j = 1 if points to j A: Adjacency matrix in which a i,j = 1 if points to j

Markov chains A Markov chain consists of n states, plus an n  n transition probability matrix P A Markov chain consists of n states, plus an n  n transition probability matrix P At each step, we are in exactly one of the states At each step, we are in exactly one of the states For 1  i,j  n, the matrix entry P ij tells us the probability of j being the next state, given we are currently in state i For 1  i,j  n, the matrix entry P ij tells us the probability of j being the next state, given we are currently in state i ij P ij P ii >0 is OK.

Markov chains Clearly, for all i, Clearly, for all i, Markov chains are abstractions of random walks Markov chains are abstractions of random walks Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain, for this case: Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain, for this case:

Ergodic Markov chains A Markov chain is ergodic if A Markov chain is ergodic if –you have a path from any state to any other –you can be in any state at every time step, with non-zero probability Not ergodic (even/ odd).

Ergodic Markov chains For any ergodic Markov chain, there is a unique long-term visit rate for each state For any ergodic Markov chain, there is a unique long-term visit rate for each state –Steady-state distribution Over a long time-period, we visit each state in proportion to this rate Over a long time-period, we visit each state in proportion to this rate It doesn’t matter where we start It doesn’t matter where we start

Probability vectors A probability (row) vector x = (x 1, … x n ) tells us where the walk is at any point A probability (row) vector x = (x 1, … x n ) tells us where the walk is at any point E.g., (000…1…000) means we’re in state i E.g., (000…1…000) means we’re in state i in1 More generally, the vector x = (x 1, … x n ) means the walk is in state i with probability x i

Change in probability vector If the probability vector is x = (x 1, … x n ) at this step, what is it at the next step? If the probability vector is x = (x 1, … x n ) at this step, what is it at the next step? Recall that row i of the transition prob. Matrix P tells us where we go next from state i Recall that row i of the transition prob. Matrix P tells us where we go next from state i So from x, our next state is distributed as xP So from x, our next state is distributed as xP

Computing the visit rate The steady state looks like a vector of probabilities a = (a 1, … a n ): The steady state looks like a vector of probabilities a = (a 1, … a n ): –a i is the probability that we are in state i 12 3/4 1/4 3/41/4 For this example, a 1 =1/4 and a 2 =3/4

How do we compute this vector? Let a = (a 1, … a n ) denote the row vector of steady-state probabilities Let a = (a 1, … a n ) denote the row vector of steady-state probabilities If we our current position is described by a, then the next step is distributed as aP If we our current position is described by a, then the next step is distributed as aP But a is the steady state, so a=aP But a is the steady state, so a=aP Solving this matrix equation gives us a Solving this matrix equation gives us a –So a is the (left) eigenvector for P –(Corresponds to the “principal” eigenvector of P with the largest eigenvalue)

One way of computing a Recall, regardless of where we start, we eventually reach the steady state a Recall, regardless of where we start, we eventually reach the steady state a Start with any distribution (say x=(10…0)) Start with any distribution (say x=(10…0)) After one step, we’re at xP After one step, we’re at xP after two steps at xP 2, then xP 3 and so on after two steps at xP 2, then xP 3 and so on “Eventually” means for “large” k, xP k = a “Eventually” means for “large” k, xP k = a Algorithm: multiply x by increasing powers of P until the product looks stable Algorithm: multiply x by increasing powers of P until the product looks stable

Lempel: Salsa By applying ergodic theorem, Lempel has proved that: By applying ergodic theorem, Lempel has proved that: –a i is proportional to number of incoming links

Pagerank summary Preprocessing: Preprocessing: –Given graph of links, build matrix P –From it compute a –The entry a i is a number between 0 and 1: the PageRank of page i Query processing: Query processing: –Retrieve pages meeting query –Rank them by their PageRank –Order is query-independent

The reality Pagerank is used in Google, but so are many other clever heuristics Pagerank is used in Google, but so are many other clever heuristics

Search Engines & Question Answering Giuseppe Attardi Università di Pisa.

Similar presentations

Presentation on theme: "Search Engines & Question Answering Giuseppe Attardi Università di Pisa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Search Engines & Question Answering Giuseppe Attardi Università di Pisa.

Similar presentations

Presentation on theme: "Search Engines & Question Answering Giuseppe Attardi Università di Pisa."— Presentation transcript:

Similar presentations

About project

Feedback