Download presentation
Presentation is loading. Please wait.
1
Web Algorithmics Web Search Engines
2
Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!? Goal of a Search Engine
3
Two main difficulties The Web: Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Extracting “significant data” is difficult !! Matching “user needs” is difficult !!
4
Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Fourth generation Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] Google, Yahoo, MSN, ASK,………
6
+$+$ -$-$
8
This is a search engine!!!
9
Wolfram Alpha
10
Clusty
11
Yahoo! Correlator
12
Web Algorithmics The structure of a Search Engine
13
The structure Web Crawler Page archive Control Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer
15
Generating the snippets !
16
The big fight: find the best ranking...
17
Ranking: Google vs Google.cn
18
Problem: Indexing Consider Wikipedia En: Collection size ≈ 10 Gbytes # docs ≈ 4 * 10 6 #terms in total > 1 billion (avg term len = 6 chars) #terms distinct = several millions Which kind of data structure do we build to support word-based searches ?
19
DB-based solution: Term-Doc matrix 1 if play contains word, 0 otherwise #terms > 1M #docs ≈ 4M Space ≈ 4Tb !
20
Current solution: Inverted index Brutus the Calpurnia 12358132134 2461032 1316 Currently they get 1 3% original text A term like Calpurnia may use log 2 N bits per occurrence A term like the should take about 1 bit per occurrence
21
Gap-coding for postings Sort the docIDs Store gaps between consecutive docIDs: Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Two advantages: Space: store smaller integers (clustering?) Speed: query requires just a scan
22
code for integer encoding v > 0 and Length = log 2 v +1 e.g., v=9 represented as. code for v takes 2 log 2 v +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(v) = 1/2v 2, and i.i.d integers
23
Rice code (simplification of Golomb code) It is a parametric code: depends on k Quotient q= (v-1)/k , and the rest is r= v – k * q – 1 Useful when integers concentrated around k How do we choose k ? Usually k 0.69 * mean(v) [Bernoulli model] Optimal for Pr(v) = p (1-p) v-1, where mean(v)=1/p, and i.i.d ints [q times 0s] 1 Log k bits
24
PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions Translate data: [base, base + 2 b -1] [0,2 b -1]
25
Interpolative coding = 1 1 1 2 2 2 2 4 3 1 1 1 M = 1 2 3 5 7 9 11 15 18 19 20 21 Recursive coding preorder traversal of a balanced binary tree At every step we know (initially, they are encoded): num = |M| = 12, Lidx=1, low = 1, Ridx=12, hi = 21 Take the middle element: h= (Lidx+Ridx)/2=6 M[6]=9, left_size= h – Lidx = 5, right_size= Ridx-h = 6 low + left_size =1+5 = 6 ≤ M[h] ≤ hi – right_size = (21 – 6) = 15 We can encode 9 in log 2 (15-6+1) = 4 bits lo=1, hi=9-1=8, num=5 lo=9+1=10, hi=21, num=6
26
Query processing 1)Retrieve all pages matching the query Brutus the Caesar 12358132134 2461332 413 17
27
Some optimization Best order for query processing ? Shorter lists first… Brutus The Calpurnia 12358132134 2461332 413 Query: Brutus AND Calpurnia AND The 17
28
Expand the posting lists with word positions to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Larger space occupancy, 5÷8% on Web Phrase queries
29
Query processing 1)Retrieve all pages matching the query 2)Order pages according to various scores: Term position & freq (body, title, anchor,…) Link popularity User clicks or preferences Brutus the Caesar 12358132134 2461332 413 17
30
The structure Web Crawler Page archive Control Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer
31
Web Algorithmics Text-based Ranking (1° generation)
32
A famous “weight”: tf-idf Frequency of term t in doc d = #occ t / |d| tf t,d where n t = #docs containing term t n = #docs in the indexed collection log n n idf t t Vector Space model
33
A graphical example Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 cos( ) = v w / ||v|| * ||w|| The user query is a very short doc Easy to Spam Sophisticated algos to find top-k docs for a query Q
34
Approximate top-k results Preprocess: Assign to each term, its m best documents Search: If |Q| = q terms, merge their preferred lists ( mq answers). Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically. Now SE use tf-idf PLUS PageRank (PLUS other weights)
35
Web Algorithmics Link-based Ranking (2° generation)
36
Query-independent ordering First generation: using link counts as simple measures of popularity. Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). Easy to SPAM
37
Second generation: PageRank Each link has its own importance!! PageRank is independent of the query many interpretations…
38
Basic Intuition… Any node 1-d Neighbors d
39
Google’s Pagerank B(i) : set of pages linking to i. #out(i) : number of outgoing links from i. Fixed value Principal eigenvector
40
Three different interpretations Graph (intuitive interpretation) Co-citation Matrix (easy for computation) Eigenvector computation or a linear system solution Markov Chain (useful to prove convergence) a sort of Usage Simulation Any node 1-d Neighbors d “In the steady state” each page has a long-term visit rate - use this as the page’s score.
41
Pagerank: use in Search Engines Preprocessing: Given graph of links, build matrix L Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order Query processing: Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.