Introduction to Information Retrieval

Slides:



Advertisements
Similar presentations
Web Marketing for Dummies Written by Jan Zimmerman Reviewed by Paige Petersen.
Advertisements

Topics we will discuss tonight: 1.Introduction to Google Adwords platform 2.Understanding how to text ads are used. Display advertising will not be discussed.
CpSc 881: Information Retrieval. 2 Web search overview.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Maximise Your Online Presence SEO & Social Media Strategies For Local Business Owners.
Google Is The Search King Google is the dominant search provider… for now! Microsoft’s BING! is on the rise. Mobile advertising is still young; will Google.
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Competition for Google AdWords A P EEP 142 4/13/06.
Link Analysis, PageRank and Search Engines on the Web
How to take advantage of search engines for your local business.. THE LAST FRONTIER LOCAL and MOBILE SEARCH Take advantage with…
SIMS Online advertising Hal Varian. SIMS Online advertising Banner ads (Doubleclick) –Standardized ad shapes with images –Normally not related to content.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 19: Web Search 1.
CLICK FRAUD Alexander Tuzhilin By Vinny Rey. Why was the study done? Google was getting sued by advertisers because of click fraud. Google agreed to have.
Google Online Marketing Challenge (GOMC)
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Micro-Site and Business Directory Customer Sales Call Presentation Explaining and Selling PLATINUM & GOLD Micro-site Business Directory Listings to the.
AdWords Instructor: Dawn Rauscher. Quality Score in Action 0a2PVhPQhttp:// 0a2PVhPQ.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Introduction to SEO August 2011 NowSourcing, Inc..
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The Online World ONLINE ADVERTISING BTEC IC&T – LEVEL 2.
Week 1 Introduction to Search Engine Optimization.
 SEO Terms A few additional terms Search site: This Web site lets you search through some kind of index or directory of Web sites, or perhaps both an.
Joseph Merrill, Alexander Plum, O’Bryne. * Google Translate is used to translate text from one language to another one of the 80 supported languages *
PPC Tutorial For Beginners. Content  PPC  Paid & Organic Advertisement  What is Search Engine?  How to set up account in Google Adwords? Target Your.
Understanding Credit Cards Learning about the little piece of plastic with a big responsibility Comparecards.com.
KiloBytes Technologies “New Face Of Technology” / Website: PPCwww.kilobytes.inPPC.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 19: Web Search 1.
Search Engine Optimization
SEARCH ENGINE OPTIMIZATION.
Personal Finance Credit.
Cybermalls Presentation
Introduction Adult website business is very big and it has loads of cash. You cannot imagine how much a single famous porn site makes a day. There are.
22C:145 Artificial Intelligence
Google Is The Search King
Do you know these international brands?
KiloBytes Technologies “New Face Of Technology”
بازاریابی دیجیتال در یک نگاه
SEARCH ENGINE OPTIMIZATION
CSC 102 Lecture 12 Nicholas R. Howe
Technology and Marketing The Awesomeness That Is Google Part II
Search Engines and Link Analysis on the Web
Online Advertising and Ad Auctions at Google
Website Building & E-Commerce for Your Pure Water Business
Digital Marketing Overview
Digital Marketing Overview
Search engine advertising
What is Digital Marketing? What is the use of Digital Marketing? Strategies of Digital Marketing Opportunities Search Engine Optimization.
Key points Content :- What is Digital Marketing? What is the use of Digital Marketing? Strategies of Digital Marketing Opportunities Search Engine Optimization.
1 SEO is short for search engine optimization. Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount.
Google Is The Search King
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Interactive media.
Fred Dirkse CEO, OIC Group, Inc.
Training Deck – SEM & Facebook Ads
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Calculating Investment Returns
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Maximizing Exposure for Your Non-Profit
Data Mining Chapter 6 Search Engines
Text Analysis and Search Analytics
SEMINAR FOUR: The Secrets of the PPC and Internet Marketing Millionaires
SEMINAR FIVE: Mobile Advertising on Google – Plus the Power of Remarketing
DIGITAL MARKETING SERVICES FOR YOUR BUSINESS ftlmedia.com.
Text Analysis and Search Analytics
Personal Finance Unit Unit 3.
Presentation transcript:

Introduction to Information Retrieval Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Introduction to Information Retrieval http://informationretrieval.org IIR 19: Web Search Schu¨tze: Web search

Overview 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

Recap Big picture Ads Duplicate detection Indexing anchor text Spam Web IR Size of the web Anchor text is often a better description of a page’s content than the page itself. Anchor text can be weighted more highly than the text on the page. A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. [dangerous cult] on Google, Bing, Yahoo Schu¨tze: Web search

Recap Big picture PageRank Ads Duplicate detection Spam Web IR Size of the web Model: a web surfer doing a random walk on the web Formalization: Markov chain PageRank is the long-term visit rate of the random surfer or the steady-state distribution. Need teleportation to ensure well-defined PageRank Power method to compute PageRank PageRank is the principal left eigenvector of the transition probability matrix. Schu¨tze: Web search

Computing PageRank: Power method Recap Big picture Ads Duplicate detection Spam Web IR Computing PageRank: Power method Size of the web x1 Pt (d1) x2 Pt (d2) P11 = 0.1 P21 = 0.3 P12 = 0.9 P22 = 0.7 t0 1 0.3 0.7 = xx P t1 0.24 0.76 = xx P 2 t2 0.252 0.748 = xx P 3 t3 0.2496 0.7504 = xx P 4 t∞ 0.25 0.75 . . . 0.25 0.75 = xx P ∞ PageRank vector = x (π1, π2) = (0.25, 0.75) Pt (d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21 Pt (d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22 Schu¨tze: Web search

HITS: Hubs and authorities Recap Big picture Ads Duplicate detection HITS: Hubs and authorities Spam Web IR Size of the web hubs authorities www.bestfares.com www.aa.com www.airlinesquality.com www.delta.com blogs.usatoday.com/sky www.united.com aviationblog.dallasnews.com Schu¨tze: Web search

HITS update rules A: link matrix h: vector of hub scores Recap Big picture Ads Duplicate detection HITS update rules Spam Web IR Size of the web A: link matrix h: vector of hub scores A: vector of authority scores HITS algorithm: Compute h= Aa Compute a = AT h Iterate until convergence Output (i) list of hubs ranked according to hub score and (ii) list of authorities ranked according to authority score Schu¨tze: Web search

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

B1g p1cture Web search overview The Web Indexes

--- Search is a top activity on the web -iProspect" 81g p1cture 1 " " Search is a top activity on the web How often do you use search engines on the Internet? Several times each week N. least once each week Several times each month Less frequently Never --- 100 200 300 400 500 Number of Responses 600 700 -iProspect" Schutze Web search 11 / 123

I Without search engines, the web wouldn 't work I'I ' 1 1 1u r.... I Without search engines, the web wouldn 't work

Without search engines, the web wouldn’t work Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Without search engines, the web wouldn’t work Without search, content is hard to find. Schu¨tze: Web search 12 / 123

Without search engines, the web wouldn’t work Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Without search engines, the web wouldn’t work Without search, content is hard to find. → Without search, there is no incentive to create content. Schu¨tze: Web search 12 / 123

Without search engines, the web wouldn’t work Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Without search engines, the web wouldn’t work Without search, content is hard to find. → Without search, there is no incentive to create content. Why publish something if nobody will read it? Schu¨tze: Web search 12 / 123

Without search engines, the web wouldn’t work Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Without search engines, the web wouldn’t work Without search, content is hard to find. → Without search, there is no incentive to create content. Why publish something if nobody will read it? Why publish something if I don’t get ad revenue from it? Schu¨tze: Web search 12 / 123

Without search engines, the web wouldn’t work Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Without search engines, the web wouldn’t work Without search, content is hard to find. → Without search, there is no incentive to create content. Why publish something if nobody will read it? Why publish something if I don’t get ad revenue from it? Somebody needs to pay for the web. Schu¨tze: Web search 12 / 123

Without search engines, the web wouldn’t work Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Without search engines, the web wouldn’t work Without search, content is hard to find. → Without search, there is no incentive to create content. Why publish something if nobody will read it? Why publish something if I don’t get ad revenue from it? Somebody needs to pay for the web. Servers, web infrastructure, content creation A large part today is paid by search ads. Schu¨tze: Web search 12 / 123

Without search engines, the web wouldn’t work Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Without search engines, the web wouldn’t work Without search, content is hard to find. → Without search, there is no incentive to create content. Why publish something if nobody will read it? Why publish something if I don’t get ad revenue from it? Somebody needs to pay for the web. Servers, web infrastructure, content creation A large part today is paid by search ads. Search pays for the web. Schu¨tze: Web search 12 / 123

Recap Big picture Ads Duplicate detection Interest aggregation Spam Web IR Size of the web Unique feature of the web: A small number of geographically dispersed people with similar interests can find each other. Schu¨tze: Web search 13 / 123

Recap Big picture Ads Duplicate detection Interest aggregation Spam Web IR Size of the web Unique feature of the web: A small number of geographically dispersed people with similar interests can find each other. Elementary school kids with hemophilia Schu¨tze: Web search 13 / 123

Recap Big picture Ads Duplicate detection Interest aggregation Spam Web IR Size of the web Unique feature of the web: A small number of geographically dispersed people with similar interests can find each other. Elementary school kids with hemophilia People interested in translating R5R5 Scheme into relatively portable C (open source project) Schu¨tze: Web search 13 / 123

Recap Big picture Ads Duplicate detection Interest aggregation Spam Web IR Size of the web Unique feature of the web: A small number of geographically dispersed people with similar interests can find each other. Elementary school kids with hemophilia People interested in translating R5R5 Scheme into relatively portable C (open source project) Search engines are a key enabler for interest aggregation. Schu¨tze: Web search 13 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . Schu¨tze: Web search 14 / 123

On the web, search is not just a nice feature. Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. The web is a chaotic und uncoordinated collection. Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. The web is a chaotic und uncoordinated collection. No control / restrictions on who can author content Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. The web is a chaotic und uncoordinated collection. No control / restrictions on who can author content The web is very large. Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. → look at search ads The web is a chaotic und uncoordinated collection. No control / restrictions on who can author content The web is very large. Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. → look at search ads The web is a chaotic und uncoordinated collection. → lots of duplicates – need to detect duplicates No control / restrictions on who can author content The web is very large. Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. → look at search ads The web is a chaotic und uncoordinated collection. → lots of duplicates – need to detect duplicates No control / restrictions on who can author content → lots of spam – need to detect spam The web is very large. Schu¨tze: Web search 14 / 123

IR on the web vs. IR in general Recap Big picture Ads Duplicate detection Spam IR on the web vs. IR in general Web IR Size of the web On the web, search is not just a nice feature. Search is a key enabler of the web: . . . . . . financing, content creation, interest aggregation etc. → look at search ads The web is a chaotic und uncoordinated collection. → lots of duplicates – need to detect duplicates No control / restrictions on who can author content → lots of spam – need to detect spam The web is very large. → need to know how big it is Schu¨tze: Web search 14 / 123

Take-away today Big picture Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Big picture Schu¨tze: Web search 15 / 123

Take-away today Big picture Ads – they pay for the web Recap Big picture Ads Take-away today Duplicate detection Spam Web IR Size of the web Big picture Ads – they pay for the web Schu¨tze: Web search 15 / 123

Take-away today Big picture Ads – they pay for the web Recap Big picture Ads Take-away today Duplicate detection Spam Web IR Size of the web Big picture Ads – they pay for the web Duplicate detection – addresses one aspect of chaotic content creation Schu¨tze: Web search 15 / 123

Take-away today Big picture Ads – they pay for the web Recap Big picture Ads Take-away today Duplicate detection Spam Web IR Size of the web Big picture Ads – they pay for the web Duplicate detection – addresses one aspect of chaotic content creation Spam detection – addresses one aspect of lack of central access control Schu¨tze: Web search 15 / 123

Take-away today Big picture Ads – they pay for the web Recap Big picture Ads Take-away today Duplicate detection Spam Web IR Size of the web Big picture Ads – they pay for the web Duplicate detection – addresses one aspect of chaotic content creation Spam detection – addresses one aspect of lack of central access control Probably won’t get to today Web information retrieval Size of the web Schu¨tze: Web search 15 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search 16 / 123

.-..·. ..... ..._... ""' F1rst generat1on of search ads Goto (1996) r;.. •r: t .;c r:< --h_.-.. A D-1= I :::..+.. :>.-<.. i:J .- r:•r F1rst generat1on of search ads ''!J:: 1r; r.. tit,. '"-t Goto (1996) .-..·. ..... ""' 2. C:p!dwnll8pnkorSeq Coost RaoJiy Wilmington's numbet omt teoil estate comp&r y. WWW.Cbse«''aSt.COin (Coltt4 •d..rtlJUi ..._... 3. Wil mington. N C RopJ Eslot o SQcky Dullard Evetythi(IQ vou n*<f to know abOut bvvn· g ot selhn.g a hom• e Of! my w•b sit &l www.rwwc• .ni/U (C:ort to •d., un JJlo.ll_)

First generation of search ads: Goto (1996) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web First generation of search ads: Goto (1996) Buddy Blake bid the maximum ($0.38) for this search. Schu¨tze: Web search 18 / 123

First generation of search ads: Goto (1996) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web First generation of search ads: Goto (1996) Buddy Blake bid the maximum ($0.38) for this search. He paid $0.38 to Goto every time somebody clicked on the link. Schu¨tze: Web search 18 / 123

First generation of search ads: Goto (1996) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web First generation of search ads: Goto (1996) Buddy Blake bid the maximum ($0.38) for this search. He paid $0.38 to Goto every time somebody clicked on the link. Pages were simply ranked according to bid – revenue maximization for Goto. Schu¨tze: Web search 18 / 123

First generation of search ads: Goto (1996) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web First generation of search ads: Goto (1996) Buddy Blake bid the maximum ($0.38) for this search. He paid $0.38 to Goto every time somebody clicked on the link. Pages were simply ranked according to bid – revenue maximization for Goto. No separation of ads/docs. Only one result list! Schu¨tze: Web search 18 / 123

First generation of search ads: Goto (1996) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web First generation of search ads: Goto (1996) Buddy Blake bid the maximum ($0.38) for this search. He paid $0.38 to Goto every time somebody clicked on the link. Pages were simply ranked according to bid – revenue maximization for Goto. No separation of ads/docs. Only one result list! Upfront and honest. No relevance ranking, . . . Schu¨tze: Web search 18 / 123

First generation of search ads: Goto (1996) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web First generation of search ads: Goto (1996) Buddy Blake bid the maximum ($0.38) for this search. He paid $0.38 to Goto every time somebody clicked on the link. Pages were simply ranked according to bid – revenue maximization for Goto. No separation of ads/docs. Only one result list! Upfront and honest. No relevance ranking, . . . . . . but Goto did not pretend there was any. Schu¨tze: Web search 18 / 123

Second generation of search ads: Google (2000/2001) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Second generation of search ads: Google (2000/2001) Strict separation of search results and search ads Schu¨tze: Web search 19 / 123

Two ranked lists: web pages (left) and ads (right) I q 1 I 1 111 1 1 1r Ads I 1111 I 1 11 I 1 1 1 11 11 1111 I I I 1 I 1 11 I Two ranked lists: web pages (left) and ads (right) Web .lrn.aQ..e..s_ M.aQ.s_ .s.b.QQQing .G.mai! ffiQ_[e_ Coogle"ld'i:s-c:o:::u:cn:t:-b.::ro:"k-e::-r:--------- Search ! Web Results 1-10 of about 807,000 for discount broker [.d..efi.nllion . (0.12 seconds) Sponsored Links Rated #1 Online Broker Discount Broker Reviews Information on online discount brokers emphasizing rates, charges, and customer comments and complaints www.broker-reviews .us/- 94k- .c..a.c.h.e..cl - Sl.rn.i!.ar..Q. No Minimums. No Inactivity Fee Transfer to FirstradeforFree! Discount Broker Rankings l2008 Broker Survey) at SmartMoney com Discount Brokers. Rank/ Brokerage/ Minimum to Open Account , Comments , Standard Commis- sion•, Reduced Commission .Account Fee Per Year (How to Avoid) , Avg. ... www.smartmoney.com/ brokerslindex.cfm?story =2004-discount -table -121k­ .c..a..c.b..e.. - Si.rn.i!.ar..Q. Discount Broker Commissionfreetradesfor30days No maintenance fees. Sign up now TDAMER ITRAD E.com Stock Brokers I Discount Brokers I Online Brokers Most Recommended. Top 5 Brokers headlines. 10. Don't Pay YourBrokerforFree Funds May 15 at 3:39PM. 5. Don't Discount the Discounters Apr 18 at 2:41 PM ... www.fool.com/investinglb rokerslindex.aspx- 44k- .c..a.c.b..e..d - Sl.rn.i!.ar..Q. TradeKing - Online Broker $4.95perTrade,MarketorLimit www.TradeKing.com SmartMo ney Top Discount Broker 200i Scottrade Brokerage $7 Trades , No Share Limit. In-Depth Research. Start Trad ing Online Now! www.Scottrade.com Ojscount Broker Discount Broker- Definition of Discount Broker on lnvestopedia- A stockbroker who carries out buy and sell orders at a reduced commission compared to a ... www.investopedia.com/terrns/d/ discountbrok er.asp- 31k- - Sl.rn.i!.ar..Q. Stock trades $1 50- $3 100freetrades,upto$100back fortransfercosts,$500minimum www.sogotrade.com Discount Brokerage and Onljne Trading for Smart Stock Market ... Online stock broker SogoTrade offers the best in discount brokerage investing. Get stock marketquotesfromthisintemetstocktradingcompany www.sogotrade.com/- 39k- .c..a.c.h.e..cl - Sl.rn.i!.ar..Q. $3 95 Onljne Stock Trades Market/Limit Orders, No Share Limit and No Inactivity Fees www.Marsco.com 15 questions to as k djscount brokers - MSN Money Jan 11, 2004 ... If you're not big on hand-holding when it comes to investing, a discount broker can be an economical way to go. Just be sure to ask these ... moneycentral.msn.corn/content/ lnvesting/Startinvesting/P66171.asp- 34k­ - Sl.rn.i!.ar..Q. Schutze Web search 20 123

Two ranked lists: web pages (left) and ads (right) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Two ranked lists: web pages (left) and ads (right) SogoTrade pears in ads. ap- Schu¨tze: Web search 20 / 123

Two ranked lists: web pages (left) and ads (right) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Two ranked lists: web pages (left) and ads (right) SogoTrade pears in results. ap- search SogoTrade pears in ads. ap- Schu¨tze: Web search 20 / 123

Two ranked lists: web pages (left) and ads (right) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Two ranked lists: web pages (left) and ads (right) SogoTrade pears in results. ap- search SogoTrade pears in ads. ap- Do search engines rank advertis- ers higher than non-advertisers? Schu¨tze: Web search 20 / 123

Two ranked lists: web pages (left) and ads (right) Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Two ranked lists: web pages (left) and ads (right) SogoTrade pears in results. ap- search SogoTrade pears in ads. ap- Do search engines rank advertis- ers higher than non-advertisers? All major search engines claim no. Schu¨tze: Web search 20 / 123

I Do ads influence editorial content?

Do ads influence editorial content? Recap Big picture Ads Duplicate detection Spam Web IR Do ads influence editorial content? Size of the web Similar problem at newspapers / TV channels Schu¨tze: Web search 21 / 123

Do ads influence editorial content? Recap Big picture Ads Duplicate detection Spam Web IR Do ads influence editorial content? Size of the web Similar problem at newspapers / TV channels A newspaper is reluctant to publish harsh criticism of its major advertisers. Schu¨tze: Web search 21 / 123

Do ads influence editorial content? Recap Big picture Ads Duplicate detection Spam Web IR Do ads influence editorial content? Size of the web Similar problem at newspapers / TV channels A newspaper is reluctant to publish harsh criticism of its major advertisers. The line often gets blurred at newspapers / on TV. Schu¨tze: Web search 21 / 123

Do ads influence editorial content? Recap Big picture Ads Duplicate detection Spam Web IR Do ads influence editorial content? Size of the web Similar problem at newspapers / TV channels A newspaper is reluctant to publish harsh criticism of its major advertisers. The line often gets blurred at newspapers / on TV. No known case of this happening with search engines yet? Schu¨tze: Web search 21 / 123

How are the ads on the right ranked? Web .lrn.aQ..e..s_ M.aQ.s_ .s.b.QQQing .G.mai! ffiQ_[e_ Coogle !disco unt broker Web Results 1- 10 of about 807,000 for discount broker [.d..efinllion] . (0.12 seconds) Sponsored Links Rated #1 Onljne Broker Discount Broker Reviews Information on online discount brokers emphasizing rates, charges , and customer comments and complaints www.broker-reviews .us/- 94k- .c..a.c.h.e..cl - S.imi!.ar..Q..a No Minimums. No Inactivity Fee Transfer to Firstrade for Free! www.firstrade.com Discount Broker Rankings l2008 Broker Survey) at SmartMoney com Discount Brokers . Rank/ Brokerage/ Minimum to Open Account , Comments , Standard Commis - sian•, Reduced Commission .Account Fee Per Year (How to Avoid), Avg. .. www.smartmoney.com/ brokerslindex.cfm?story =2004-discount -table- 121k­ _c a c_b_e c - S.imi!.ar..Q..a Discount Broker Commission free trades for 30 days No maintenance fees. Sign up now TDAMER ITRADE.com TradeKjng - Online Broker $4 .95 per Trade, Market or Limit SmartMoney Top Discount Broker 200i www.TradeKing.com Stock Brokers I Discount Brokers I Onljne Brokers Most Recommended. Top 5 Brokers headlines. 10. Don't Pay Your Broker for Free Funds May 15 at 3:39PM. 5. Don't Discount the Discounters Apr 18 at 2 :41PM .. www.fool.com/investinglb rokers/index.aspx- 44k- .c..a.c.b..e.d. - Slm.i!.ar..Q..a Scottrade Brokerage $7 Trades, No Share Limit. In-Depth Research. Start Trading Online Now ! www.Scottrade.com Ojscount Broker Discount Broker - Definition of Discount Broker on lnvestopedia - A stockbroker who carries out buy and sell orders at a reduced commission compared to a .. www.investopedia.com/terms/d/ discountbrok er.asp- 31k- - S.imi!.ar..Q..a Stock trades $1 50 - $3 100 free trades, up to $100 back for transfer costs, $500 minimum www.sogotrade.com Discount Brokerage and Online Tradjng for Smart Stock Market , Online stoc k broker SogoTrade offers the best in discount brokerage investing. Get stock market quotes from this internet stock trading company www.sogotrade.com/- 39k- .c..a..c.h.eQ - S.imi!.ar..Q..a $3 95 Online Stock Trades Market/Limit Orders, No Share Limit and No Inactivity Fees www .Marsco.com 15 questjons to ask djscount brokers- MSN Money Jan 11, 2004 ... If you're not big on hand-holding when it comes to investing, a discount broker can be an economica l way to go. Just be sure to ask these .. moneycentral.msn.com/content/lnvesting/Startinvesting/P66171.asp- 34k­ - S.imi!.ar..Q..a Schutze Web search 22 / 123

Ad< I How are ads ranked?

How are ads ranked? Advertisers bid for keywords – sale by auction. Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web Advertisers bid for keywords – sale by auction. Schu¨tze: Web search 23 / 123

How are ads ranked? Advertisers bid for keywords – sale by auction. Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web Advertisers bid for keywords – sale by auction. Open system: Anybody can participate and bid on keywords. Schu¨tze: Web search 23 / 123

How are ads ranked? Advertisers bid for keywords – sale by auction. Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web Advertisers bid for keywords – sale by auction. Open system: Anybody can participate and bid on keywords. Advertisers are only charged when somebody clicks on your ad. Schu¨tze: Web search 23 / 123

How are ads ranked? Advertisers bid for keywords – sale by auction. Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web Advertisers bid for keywords – sale by auction. Open system: Anybody can participate and bid on keywords. Advertisers are only charged when somebody clicks on your ad. How does the auction determine an ad’s rank and the price paid for the ad? Schu¨tze: Web search 23 / 123

How are ads ranked? Advertisers bid for keywords – sale by auction. Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web Advertisers bid for keywords – sale by auction. Open system: Anybody can participate and bid on keywords. Advertisers are only charged when somebody clicks on your ad. How does the auction determine an ad’s rank and the price paid for the ad? Basis is a second price auction, but with twists Schu¨tze: Web search 23 / 123

How are ads ranked? Advertisers bid for keywords – sale by auction. Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web Advertisers bid for keywords – sale by auction. Open system: Anybody can participate and bid on keywords. Advertisers are only charged when somebody clicks on your ad. How does the auction determine an ad’s rank and the price paid for the ad? Basis is a second price auction, but with twists For the bottom line, this is perhaps the most important research area for search engines – computational advertising. Schu¨tze: Web search 23 / 123

How are ads ranked? Advertisers bid for keywords – sale by auction. Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web Advertisers bid for keywords – sale by auction. Open system: Anybody can participate and bid on keywords. Advertisers are only charged when somebody clicks on your ad. How does the auction determine an ad’s rank and the price paid for the ad? Basis is a second price auction, but with twists For the bottom line, this is perhaps the most important research area for search engines – computational advertising. Squeezing an additional fraction of a cent from each ad means billions of additional revenue for the search engine. Schu¨tze: Web search 23 / 123

Ad< I How are ads ranked?

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Key measure of ad relevance: clickthrough rate Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Key measure of ad relevance: clickthrough rate clickthrough rate = CTR = clicks per impressions Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Key measure of ad relevance: clickthrough rate clickthrough rate = CTR = clicks per impressions Result: A nonrelevant ad will be ranked low. Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Key measure of ad relevance: clickthrough rate clickthrough rate = CTR = clicks per impressions Result: A nonrelevant ad will be ranked low. Even if this decreases search engine revenue short-term Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Key measure of ad relevance: clickthrough rate clickthrough rate = CTR = clicks per impressions Result: A nonrelevant ad will be ranked low. Even if this decreases search engine revenue short-term Hope: Overall acceptance of the system and overall revenue is maximized if users get useful information. Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Key measure of ad relevance: clickthrough rate clickthrough rate = CTR = clicks per impressions Result: A nonrelevant ad will be ranked low. Even if this decreases search engine revenue short-term Hope: Overall acceptance of the system and overall revenue is maximized if users get useful information. Other ranking factors: location, time of day, quality and loading speed of landing page Schu¨tze: Web search 24 / 123

How are ads ranked? First cut: according to bid price `a la Goto Recap Big picture Ads Duplicate detection How are ads ranked? Spam Web IR Size of the web First cut: according to bid price `a la Goto Bad idea: open to abuse Example: query [treatment for cancer?] → how to write your last will We don’t want to show nonrelevant or offensive ads. Instead: rank based on bid price and relevance Key measure of ad relevance: clickthrough rate clickthrough rate = CTR = clicks per impressions Result: A nonrelevant ad will be ranked low. Even if this decreases search engine revenue short-term Hope: Overall acceptance of the system and overall revenue is maximized if users get useful information. Other ranking factors: location, time of day, quality and loading speed of landing page The main ranking factor: the query Schu¨tze: Web search 24 / 123

Google AdWords demo Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Google AdWords demo Schu¨tze: Web search

Google’s second price auction Recap Big picture Ads Duplicate detection Spam Google’s second price auction Web IR Size of the web advertiser bid CTR ad rank rank paid A $4.00 0.01 0.04 4 (minimum) B $3.00 0.03 0.09 2 $2.68 C $2.00 0.06 0.12 1 $1.51 D $1.00 0.08 3 $0.51 Schu¨tze: Web search 26 / 123

Google’s second price auction Recap Big picture Ads Duplicate detection Spam Google’s second price auction Web IR Size of the web advertiser bid CTR ad rank rank paid A $4.00 0.01 0.04 4 (minimum) B $3.00 0.03 0.09 2 $2.68 C $2.00 0.06 0.12 1 $1.51 D $1.00 0.08 3 $0.51 bid: maximum bid for a click by advertiser CTR: click-through rate: when an ad is displayed, what percentage of time do users click on it? CTR is a measure of relevance. ad rank: bid × CTR: this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is rank: rank in auction paid: second price auction price paid by advertiser Schu¨tze: Web search 26 / 123

Google’s second price auction Recap Big picture Ads Duplicate detection Spam Google’s second price auction Web IR Size of the web advertiser bid CTR ad rank rank paid A $4.00 0.01 0.04 4 (minimum) B $3.00 0.03 0.09 2 $2.68 C $2.00 0.06 0.12 1 $1.51 D $1.00 0.08 3 $0.51 Second price auction: The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent). price1 × CTR1 = bid2 × CTR2 (this will result in rank1=rank2) price1 = bid2 × CTR2 / CTR1 p1 = bid2 × CTR2/CTR1 = 3.00 × 0.03/0.06 = 1.50 p2 = bid3 × CTR3/CTR2 = 1.00 × 0.08/0.03 = 2.67 p3 = bid4 × CTR4/CTR3 = 4.00 × 0.01/0.08 = 0.50 Schu¨tze: Web search 26 / 123

I Keywords with high bids Ad< I Keywords with high bids

Keywords with high bids Recap Big picture Ads Duplicate detection Keywords with high bids Spam Web IR Size of the web According to http://www.cwire.org/highest-paying-search-terms/ $69.1 $65.9 $62.6 $61.4 $59.4 $46.4 $40.1 $39.8 $39.2 $38.7 $38.0 $37.0 $35.9 mesothelioma treatment options personal injury lawyer michigan student loans consolidation car accident attorney los angeles online car insurance quotes arizona dui lawyer asbestos cancer home equity line of credit life insurance quotes refinancing equity line of credit lasik eye surgery new york city 2nd mortgage free car insurance quote Schu¨tze: Web search

I Search ads: A win-win-win?

Search ads: A win-win-win? Recap Big picture Ads Duplicate detection Search ads: A win-win-win? Spam Web IR Size of the web The search engine company gets revenue every time somebody clicks on an ad. Schu¨tze: Web search 28 / 123

Search ads: A win-win-win? Recap Big picture Ads Duplicate detection Search ads: A win-win-win? Spam Web IR Size of the web The search engine company gets revenue every time somebody clicks on an ad. The user only clicks on an ad if they are interested in the ad. Schu¨tze: Web search 28 / 123

Search engines punish misleading and nonrelevant ads. Recap Big picture Ads Duplicate detection Search ads: A win-win-win? Spam Web IR Size of the web The search engine company gets revenue every time somebody clicks on an ad. The user only clicks on an ad if they are interested in the ad. Search engines punish misleading and nonrelevant ads. Schu¨tze: Web search 28 / 123

Search ads: A win-win-win? Recap Big picture Ads Duplicate detection Search ads: A win-win-win? Spam Web IR Size of the web The search engine company gets revenue every time somebody clicks on an ad. The user only clicks on an ad if they are interested in the ad. Search engines punish misleading and nonrelevant ads. As a result, users are often satisfied with what they find after clicking on an ad. Schu¨tze: Web search 28 / 123

Search ads: A win-win-win? Recap Big picture Ads Duplicate detection Search ads: A win-win-win? Spam Web IR Size of the web The search engine company gets revenue every time somebody clicks on an ad. The user only clicks on an ad if they are interested in the ad. Search engines punish misleading and nonrelevant ads. As a result, users are often satisfied with what they find after clicking on an ad. The advertiser finds new customers in a cost-effective way. Schu¨tze: Web search 28 / 123

A d< I Exercise

Recap Big picture Exercise Ads Duplicate detection Spam Web IR Size of the web Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots? Schu¨tze: Web search 29 / 123

Recap Big picture Exercise Ads Duplicate detection Spam Web IR Size of the web Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots? The advertiser pays for all this. How can the advertiser be cheated? Schu¨tze: Web search 29 / 123

Recap Big picture Exercise Ads Duplicate detection Spam Web IR Size of the web Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots? The advertiser pays for all this. How can the advertiser be cheated? Any way this could be bad for the user? Schu¨tze: Web search 29 / 123

Recap Big picture Exercise Ads Duplicate detection Spam Web IR Size of the web Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots? The advertiser pays for all this. How can the advertiser be cheated? Any way this could be bad for the user? Any way this could be bad for the search engine? Schu¨tze: Web search 29 / 123

I Not a win-win-win: Keyword arbitrage Ad< I Not a win-win-win: Keyword arbitrage

Not a win-win-win: Keyword arbitrage Recap Big picture Ads Duplicate detection Spam Web IR Not a win-win-win: Keyword arbitrage Size of the web Buy a keyword on Google Schu¨tze: Web search 30 / 123

Not a win-win-win: Keyword arbitrage Recap Big picture Ads Duplicate detection Spam Web IR Not a win-win-win: Keyword arbitrage Size of the web Buy a keyword on Google Then redirect traffic to a third party that is paying much more than you are paying Google. Schu¨tze: Web search 30 / 123

Not a win-win-win: Keyword arbitrage Recap Big picture Ads Duplicate detection Spam Web IR Not a win-win-win: Keyword arbitrage Size of the web Buy a keyword on Google Then redirect traffic to a third party that is paying much more than you are paying Google. E.g., redirect to a page full of ads Schu¨tze: Web search 30 / 123

Not a win-win-win: Keyword arbitrage Recap Big picture Ads Duplicate detection Spam Web IR Not a win-win-win: Keyword arbitrage Size of the web Buy a keyword on Google Then redirect traffic to a third party that is paying much more than you are paying Google. E.g., redirect to a page full of ads This rarely makes sense for the user. Schu¨tze: Web search 30 / 123

Not a win-win-win: Keyword arbitrage Recap Big picture Ads Duplicate detection Spam Web IR Not a win-win-win: Keyword arbitrage Size of the web Buy a keyword on Google Then redirect traffic to a third party that is paying much more than you are paying Google. E.g., redirect to a page full of ads This rarely makes sense for the user. Ad spammers keep inventing new tricks. Schu¨tze: Web search 30 / 123

Not a win-win-win: Keyword arbitrage Recap Big picture Ads Duplicate detection Spam Web IR Not a win-win-win: Keyword arbitrage Size of the web Buy a keyword on Google Then redirect traffic to a third party that is paying much more than you are paying Google. E.g., redirect to a page full of ads This rarely makes sense for the user. Ad spammers keep inventing new tricks. The search engines need time to catch up with them. Schu¨tze: Web search 30 / 123

I Not a win-win-win: Violation of trademarks

Not a win-win-win: Violation of trademarks Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Not a win-win-win: Violation of trademarks Example: geico Schu¨tze: Web search 31 / 123

Not a win-win-win: Violation of trademarks Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Not a win-win-win: Violation of trademarks Example: geico During part of 2005: The search term “geico” on Google was bought by competitors. Schu¨tze: Web search 31 / 123

Not a win-win-win: Violation of trademarks Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Not a win-win-win: Violation of trademarks Example: geico During part of 2005: The search term “geico” on Google was bought by competitors. Geico lost this case in the United States. Schu¨tze: Web search 31 / 123

Not a win-win-win: Violation of trademarks Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Not a win-win-win: Violation of trademarks Example: geico During part of 2005: The search term “geico” on Google was bought by competitors. Geico lost this case in the United States. Louis Vuitton lost similar case in Europe. Schu¨tze: Web search 31 / 123

Not a win-win-win: Violation of trademarks Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Not a win-win-win: Violation of trademarks Example: geico During part of 2005: The search term “geico” on Google was bought by competitors. Geico lost this case in the United States. Louis Vuitton lost similar case in Europe. See http://google.com/tm complaint.html Schu¨tze: Web search 31 / 123

Not a win-win-win: Violation of trademarks Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Not a win-win-win: Violation of trademarks Example: geico During part of 2005: The search term “geico” on Google was bought by competitors. Geico lost this case in the United States. Louis Vuitton lost similar case in Europe. See http://google.com/tm complaint.html It’s potentially misleading to users to trigger an ad off of a trademark if the user can’t buy the product on the site. Schu¨tze: Web search 31 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

Duplicate detection The web is full of duplicated content. Recap Big picture Ads Duplicate detection Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Abundant on the web Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Abundant on the web Difficult to eliminate Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Abundant on the web Difficult to eliminate For the user, it’s annoying to get a search result with near-identical documents. Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Abundant on the web Difficult to eliminate For the user, it’s annoying to get a search result with near-identical documents. Marginal relevance is zero: even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate. Schu¨tze: Web search 33 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Abundant on the web Difficult to eliminate For the user, it’s annoying to get a search result with near-identical documents. Marginal relevance is zero: even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate. We need to eliminate near-duplicates. Schu¨tze: Web search 33 / 123

I Near-duplicates: Example [,u oil • J l •• j •• I ••• II• •ll I Near-duplicates: Example

Near-duplicates: Example wapedia. .,. Wiki :Michael Jackson (1/6) Dupl1( at"" d Ptf'.( t1on Near-duplicates: Example wapedia. .,. Wiki :Michael Jackson (1/6) For other persons named Michael Jackson, see Michael Jackson (disambig uation) . Michael Joseph Jackson (August 29, 1958- June 25, 2009) was an American recording artist, entertainer and businessman. The seventh child of the Jackson family, he made his debut as an entertainer in 1968 as a member of The MlchNIJoseph JackiOn (August 29, t958- .Mle 25. 2009) was an American recording artist. entertainer and businessman. The seventhchildol the JackSon lamlfy.he made hlsdebot asan Main Contonts Fea!LKedc:ontent Curntntevents Random artide ....,h ® C'-•"" l interaction hutz... \.V.:-b ..ln h 34 123

[,u oil • J l •• j •• I ••• II• •ll I Exercise

Exercise How would you eliminate near-duplicates on the web? Recap Big picture Exercise Ads Duplicate detection Spam Web IR Size of the web How would you eliminate near-duplicates on the web? Schu¨tze: Web search

I Detecting near-duplicates [,u oil • J l •• j •• I ••• II• •ll I Detecting near-duplicates

Detecting near-duplicates Recap Big picture Ads Duplicate detection Detecting near-duplicates Spam Web IR Size of the web Compute similarity with an edit-distance measure Schu¨tze: Web search 36 / 123

Detecting near-duplicates Recap Big picture Ads Duplicate detection Detecting near-duplicates Spam Web IR Size of the web Compute similarity with an edit-distance measure We want “syntactic” (as opposed to semantic) similarity. Schu¨tze: Web search 36 / 123

Detecting near-duplicates Recap Big picture Ads Duplicate detection Detecting near-duplicates Spam Web IR Size of the web Compute similarity with an edit-distance measure We want “syntactic” (as opposed to semantic) similarity. True semantic similarity (similarity in content) is too difficult to compute. Schu¨tze: Web search 36 / 123

Detecting near-duplicates Recap Big picture Ads Duplicate detection Detecting near-duplicates Spam Web IR Size of the web Compute similarity with an edit-distance measure We want “syntactic” (as opposed to semantic) similarity. True semantic similarity (similarity in content) is too difficult to compute. We do not consider documents near-duplicates if they have the same content, but express it with different words. Schu¨tze: Web search 36 / 123

Detecting near-duplicates Recap Big picture Ads Duplicate detection Detecting near-duplicates Spam Web IR Size of the web Compute similarity with an edit-distance measure We want “syntactic” (as opposed to semantic) similarity. True semantic similarity (similarity in content) is too difficult to compute. We do not consider documents near-duplicates if they have the same content, but express it with different words. Use similarity threshold θ to make the call “is/isn’t a near-duplicate”. Schu¨tze: Web search 36 / 123

Detecting near-duplicates Recap Big picture Ads Duplicate detection Detecting near-duplicates Spam Web IR Size of the web Compute similarity with an edit-distance measure We want “syntactic” (as opposed to semantic) similarity. True semantic similarity (similarity in content) is too difficult to compute. We do not consider documents near-duplicates if they have the same content, but express it with different words. Use similarity threshold θ to make the call “is/isn’t a near-duplicate”. E.g., two documents are near-duplicates if similarity > θ = 80%. Schu¨tze: Web search 36 / 123

I Represent each document as set of shingles [,u oil • J l •• j •• I ••• II• •ll I Represent each document as set of shingles

Represent each document as set of shingles Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as set of shingles A shingle is simply a word n-gram. Schu¨tze: Web search 37 / 123

Represent each document as set of shingles Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as set of shingles A shingle is simply a word n-gram. Shingles are used as features to measure syntactic similarity of documents. Schu¨tze: Web search 37 / 123

Represent each document as set of shingles Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as set of shingles A shingle is simply a word n-gram. Shingles are used as features to measure syntactic similarity of documents. For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles: Schu¨tze: Web search 37 / 123

Represent each document as set of shingles Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as set of shingles A shingle is simply a word n-gram. Shingles are used as features to measure syntactic similarity of documents. For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles: { a-rose-is, rose-is-a, is-a-rose } Schu¨tze: Web search 37 / 123

Represent each document as set of shingles Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as set of shingles A shingle is simply a word n-gram. Shingles are used as features to measure syntactic similarity of documents. For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles: { a-rose-is, rose-is-a, is-a-rose } We can map shingles to 1..2m (e.g., m = 64) by fingerprinting. Schu¨tze: Web search 37 / 123

Represent each document as set of shingles Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as set of shingles A shingle is simply a word n-gram. Shingles are used as features to measure syntactic similarity of documents. For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles: { a-rose-is, rose-is-a, is-a-rose } We can map shingles to 1..2m (e.g., m = 64) by fingerprinting. From now on: sk refers to the shingle’s fingerprint in 1..2m . Schu¨tze: Web search 37 / 123

Represent each document as set of shingles Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as set of shingles A shingle is simply a word n-gram. Shingles are used as features to measure syntactic similarity of documents. For example, for n = 3, “a rose is a rose is a rose” would be represented as this set of shingles: { a-rose-is, rose-is-a, is-a-rose } We can map shingles to 1..2m (e.g., m = 64) by fingerprinting. From now on: sk refers to the shingle’s fingerprint in 1..2m . We define the similarity of two documents as the Jaccard coefficient of their shingle sets. Schu¨tze: Web search 37 / 123

I Recall: Jaccard coefficient [,u oil • J l •• j •• I ••• II• •ll I Recall: Jaccard coefficient

Recall: Jaccard coefficient Recap Big picture Ads Duplicate detection Recall: Jaccard coefficient Spam Web IR Size of the web A commonly used measure of overlap of two sets Schu¨tze: Web search 38 / 123

Recall: Jaccard coefficient Recap Big picture Ads Duplicate detection Recall: Jaccard coefficient Spam Web IR Size of the web A commonly used measure of overlap of two sets Let A and B be two sets Schu¨tze: Web search 38 / 123

Recall: Jaccard coefficient Recap Big picture Ads Duplicate detection Recall: Jaccard coefficient Spam Web IR Size of the web A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B ) = |A ∩ B | |A ∪ B | (A I= ∅ or B I= ∅) Schu¨tze: Web search 38 / 123

Recall: Jaccard coefficient Recap Big picture Ads Duplicate detection Recall: Jaccard coefficient Spam Web IR Size of the web A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B ) = |A ∩ B | |A ∪ B | (A I= ∅ or B I= ∅) jaccard(A, A) = 1 Schu¨tze: Web search 38 / 123

Recall: Jaccard coefficient Recap Big picture Ads Duplicate detection Recall: Jaccard coefficient Spam Web IR Size of the web A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B ) = |A ∩ B | |A ∪ B | (A I= ∅ or B I= ∅) jaccard(A, A) = 1 jaccard(A, B ) = 0 if A ∩ B = 0 Schu¨tze: Web search 38 / 123

Recall: Jaccard coefficient Recap Big picture Ads Duplicate detection Recall: Jaccard coefficient Spam Web IR Size of the web A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B ) = |A ∩ B | |A ∪ B | (A I= ∅ or B I= ∅) jaccard(A, A) = 1 jaccard(A, B ) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Schu¨tze: Web search 38 / 123

Recall: Jaccard coefficient Recap Big picture Ads Duplicate detection Recall: Jaccard coefficient Spam Web IR Size of the web A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B ) = |A ∩ B | |A ∪ B | (A I= ∅ or B I= ∅) jaccard(A, A) = 1 jaccard(A, B ) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1. Schu¨tze: Web search 38 / 123

I Jaccard coefficient: Example [,u oil • J l •• j •• I ••• II• •ll I Jaccard coefficient: Example

Jaccard coefficient: Example Recap Big picture Ads Duplicate detection Spam Jaccard coefficient: Example Web IR Size of the web Three documents: d1: “Jack London traveled to Oakland” d2: “Jack London traveled to the city of Oakland” d3: “Jack traveled from Oakland to London” Schu¨tze: Web search 39 / 123

Jaccard coefficient: Example Recap Big picture Ads Duplicate detection Spam Jaccard coefficient: Example Web IR Size of the web Three documents: d1: “Jack London traveled to Oakland” d2: “Jack London traveled to the city of Oakland” d3: “Jack traveled from Oakland to London” Based on shingles of size 2 (2-grams or bigrams), what are the Jaccard coefficients J(d1, d2) and J(d1, d3)? Schu¨tze: Web search 39 / 123

Jaccard coefficient: Example Recap Big picture Ads Duplicate detection Spam Jaccard coefficient: Example Web IR Size of the web Three documents: d1: “Jack London traveled to Oakland” d2: “Jack London traveled to the city of Oakland” d3: “Jack traveled from Oakland to London” Based on shingles of size 2 (2-grams or bigrams), what are the Jaccard coefficients J(d1, d2) and J(d1, d3)? J(d1, d2) = 3/8 = 0.375 Schu¨tze: Web search 39 / 123

Jaccard coefficient: Example Recap Big picture Ads Duplicate detection Spam Jaccard coefficient: Example Web IR Size of the web Three documents: d1: “Jack London traveled to Oakland” d2: “Jack London traveled to the city of Oakland” d3: “Jack traveled from Oakland to London” Based on shingles of size 2 (2-grams or bigrams), what are the Jaccard coefficients J(d1, d2) and J(d1, d3)? J(d1, d2) = 3/8 = 0.375 J(d1, d3) = 0 Schu¨tze: Web search 39 / 123

Jaccard coefficient: Example Recap Big picture Ads Duplicate detection Spam Jaccard coefficient: Example Web IR Size of the web Three documents: d1: “Jack London traveled to Oakland” d2: “Jack London traveled to the city of Oakland” d3: “Jack traveled from Oakland to London” Based on shingles of size 2 (2-grams or bigrams), what are the Jaccard coefficients J(d1, d2) and J(d1, d3)? J(d1, d2) = 3/8 = 0.375 J(d1, d3) = 0 Note: very sensitive to dissimilarity Schu¨tze: Web search 39 / 123

I Represent each document as a sketch [,u oil • J l •• j •• I ••• II• •ll I Represent each document as a sketch

Represent each document as a sketch Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as a sketch The number of shingles per document is large. Schu¨tze: Web search 40 / 123

Represent each document as a sketch Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as a sketch The number of shingles per document is large. To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document. Schu¨tze: Web search 40 / 123

Represent each document as a sketch Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as a sketch The number of shingles per document is large. To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document. The size of a sketch is, say, n = 200 . . . Schu¨tze: Web search 40 / 123

Represent each document as a sketch Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as a sketch The number of shingles per document is large. To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document. The size of a sketch is, say, n = 200 . . . . . . and is defined by a set of permutations π1 . . . π200. Schu¨tze: Web search 40 / 123

Represent each document as a sketch Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as a sketch The number of shingles per document is large. To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document. The size of a sketch is, say, n = 200 . . . . . . and is defined by a set of permutations π1 . . . π200. Each πi is a random permutation on 1..2m Schu¨tze: Web search 40 / 123

Represent each document as a sketch Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Represent each document as a sketch The number of shingles per document is large. To increase efficiency, we will use a sketch, a cleverly chosen subset of the shingles of a document. The size of a sketch is, say, n = 200 . . . . . . and is defined by a set of permutations π1 . . . π200. Each πi is a random permutation on 1..2m The sketch of d is defined as: < mins∈d π1(s), mins∈d π2(s), . . . , mins∈d π200(s) > (a vector of 200 numbers). Schu¨tze: Web search 40 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Permutation and minimum: Example Size of the web document 1: {sk } document 2: {sk } 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m Schu¨tze: Web search 41 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Permutation and minimum: Example Size of the web document 1: {sk } document 2: {sk } • • • • • • • • 1 ✲ 2m 1 ✲ 2m s1 s2 s3 s4 s1 s5 s3 s4 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m Schu¨tze: Web search 41 / 123

✲ 2m ✲ 2m ✲ 2m ✲ 2m ✲ 2m ✲ 2m ✲ 2m Permutation and minimum: Example Recap Big picture Ads Duplicate detection Spam Web IR Permutation and minimum: Example Size of the web document 1: {sk } document 2: {sk } • • • • s1 s2 s3 s4 • • • • 1 ✲ 2m 1 ✲ 2m s1 s5 s3 s4 xk = π(sk ) xk = π(sk ) ❝ • • ❝ ❝ • • ❝ x3 x1 x4 x2 1 ✲ 2m ❝ • ❝ • ❝ • • ❝ ✲ 2m x3 x1 x4 1 x5 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m 1 ✲ 2m Schu¨tze: Web search 41 / 123

✲ 2m ✲ 2m ✲ 2m ✲ 2m ✲ 2m ✲ 2m Permutation and minimum: Example Recap Big picture Ads Duplicate detection Spam Web IR Permutation and minimum: Example Size of the web document 1: {sk } document 2: {sk } • • • • s1 s2 s3 s4 xk = π(sk ) ✲ 2m • • • • 1 1 ✲ 2m s1 s5 s3 s4 xk = π(sk ) ❝ • • ❝ ❝ • • ❝ x3 x1 x4 x2 ❝ • ❝ • ❝ • • x3 x1 x4 ❝ ✲ 2m 1 ✲ 2m 1 x5 xk xk ❝ ❝ x1 x5 1 ❝ x3 ❝ ❝ x1 x4 ❝ x2 ✲ 2m 1 ❝ x3 ❝ ✲ 2m x2 1 ✲ 2m 1 ✲ 2m Schu¨tze: Web search 41 / 123

✲ 2m ✲ 2m ✲ 2m ✲ 2m ✲ 2m ✲ 2m Permutation and minimum: Example Recap Big picture Ads Duplicate detection Spam Web IR Permutation and minimum: Example Size of the web document 1: {sk } document 2: {sk } • • • • s1 s2 s3 s4 ✲ 2m • • • • 1 1 ✲ 2m s1 s5 s3 s4 xk = π(sk ) xk = π(sk ) ❝ • • ❝ ❝ • • ❝ x3 x1 x4 x2 ✲ 2m ❝ • ❝ • ❝ • • x3 x1 x4 ❝ ✲ 2m 1 1 x5 xk ❝ ❝ x1 x4 xk ❝ ❝ x1 x5 1 ❝ x3 ❝ x2 ✲ 2m 1 ❝ x3 ❝ ✲ 2m x2 minsk π(sk ) minsk π(sk ) ❝ x3 1 ✲ 2m ❝ x3 1 ✲ 2m Schu¨tze: Web search 41 / 123

✲ 2m ✲ 2m ✲ 2m ✲ 2m Permutation and minimum: Example Recap Big picture Ads Duplicate detection Spam Web IR Permutation and minimum: Example Size of the web document 1: {sk } document 2: {sk } • • • • s1 s2 s3 s4 xk = π(sk ) ✲ 2m • • • • 1 1 ✲ 2m s1 s5 s3 s4 xk = π(sk ) ❝ • • ❝ ❝ • • ❝ x3 x1 x4 x2 ✲ 2m ❝ • ❝ • ❝ • • x3 x1 x4 ❝ ✲ 2m 1 1 x5 xk ❝ ❝ x1 x4 xk ❝ ❝ x1 x5 1 ❝ x3 ❝ x2 ✲ 2m ❝ x3 ❝ ✲ 2m 1 x2 minsk π(sk ) minsk π(sk ) ❝ x3 ✲ 2m ✲ 2m ❝ x3 1 1 We use mins∈d1 π(s) = mins∈d2 π(s) as a test for: are d1 and d2 near-duplicates? Schu¨tze: Web search 41 / 123

✲ 2m ✲ 2m ✲ 2m ✲ 2m Permutation and minimum: Example Recap Big picture Ads Duplicate detection Spam Web IR Permutation and minimum: Example Size of the web document 1: {sk } document 2: {sk } • • • • s1 s2 s3 s4 ✲ 2m • • • • 1 1 ✲ 2m s1 s5 s3 s4 xk = π(sk ) xk = π(sk ) ❝ • • ❝ ❝ • • ❝ x3 x1 x4 x2 ✲ 2m ❝ • ❝ • ❝ • • x3 x1 x4 ❝ ✲ 2m 1 1 x5 xk ❝ ❝ x1 x4 xk ❝ ❝ x1 x5 ❝ x3 ❝ x2 ✲ 2m ❝ x3 ❝ ✲ 2m 1 1 x2 minsk π(sk ) minsk π(sk ) ❝ x3 ✲ 2m ✲ 2m ❝ x3 1 1 We use mins∈ d2 π(s) = mins∈ d2π(s) as a test for: are d1 and d2 near-duplicates? In this case: permutation π says: d1 ≈ d2 Schu¨tze: Web search

I Computing Jaccard for sketches [,u oil • J l •• j •• I ••• II• •ll I Computing Jaccard for sketches

Computing Jaccard for sketches Recap Big picture Ads Duplicate detection Spam Computing Jaccard for sketches Web IR Size of the web Sketches: Each document is now a vector of n = 200 numbers. Schu¨tze: Web search 42 / 123

Computing Jaccard for sketches Recap Big picture Ads Duplicate detection Spam Computing Jaccard for sketches Web IR Size of the web Sketches: Each document is now a vector of n = 200 numbers. Much easier to deal with than the very high-dimensional space of shingles Schu¨tze: Web search 42 / 123

Computing Jaccard for sketches Recap Big picture Ads Duplicate detection Spam Computing Jaccard for sketches Web IR Size of the web Sketches: Each document is now a vector of n = 200 numbers. Much easier to deal with than the very high-dimensional space of shingles But how do we compute Jaccard? Schu¨tze: Web search 42 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈d1 π(s) = s = arg mins∈d2 π(s)? ′ Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈ d1 π(s) = s = arg mins∈ d2 π(s)? Answer: (|U| − 1)! ′ Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈ d1 π(s) = s = arg mins∈ d2 π(s)? Answer: (|U| − 1)! There is a set of (|U| − 1)! different permutations for each s in I . ⇒ |I |(|U| − 1)! permutations make arg mins∈ d1 π(s) = arg mins∈ d2 π(s) true ′ Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈ d1 π(s) = s = arg mins∈ d2 π(s)? Answer: (|U| − 1)! There is a set of (|U| − 1)! different permutations for each s in I . ⇒ |I |(|U| − 1)! permutations make arg mins∈ d1 π(s) = arg mins∈ d2 π(s) true Thus, the proportion of permutations that make mins∈ d1 π(s) = mins∈ d2 π(s) true is: |I |(|U| − 1)! ′ |U|! Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈ d1 π(s) = s = arg mins∈ d2 π(s)? Answer: (|U| − 1)! There is a set of (|U| − 1)! different permutations for each s in I . ⇒ |I |(|U| − 1)! permutations make arg mins∈ d1 π(s) = arg mins∈ d2 π(s) true Thus, the proportion of permutations that make mins∈ d1 π(s) = mins∈ d2 π(s) true is: |I |(|U| − 1)! ′ |U|! Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈ d1 π(s) = s = arg mins∈ d2 π(s)? Answer: (|U| − 1)! There is a set of (|U| − 1)! different permutations for each s in I . ⇒ |I |(|U| − 1)! permutations make arg mins∈ d1 π(s) = arg mins∈ d2 π(s) true Thus, the proportion of permutations that make mins∈ d1 π(s) = mins∈ d2π(s) true is: |I |(|U| − 1)! ′ |U|! Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈ d1 π(s) = s = arg mins∈ d2 π(s)? Answer: (|U| − 1)! There is a set of (|U| − 1)! different permutations for each s in I . ⇒ |I |(|U| − 1)! permutations make arg mins∈ d1 π(s) = arg mins∈ d2 π(s) true Thus, the proportion of permutations that make mins∈ d1 π(s) = mins∈ d2 π(s) true is: ′ |I |(|U| − 1)! |I | |U| = |U|! Schu¨tze: Web search 43 / 123

Computing Jaccard for sketches (2) Recap Big picture Ads Duplicate detection Spam Web IR Computing Jaccard for sketches (2) Size of the web How do we compute Jaccard? Let U be the union of the set of shingles of d1 and d2 and I the intersection. There are |U|! permutations on U. For s′ ∈ I , for how many permutations π do we have arg mins∈ d1 π(s) = s = arg mins∈ d2 π(s)? Answer: (|U| − 1)! There is a set of (|U| − 1)! different permutations for each s in I . ⇒ |I |(|U| − 1)! permutations make arg mins∈ d1 π(s) = arg mins∈ d2 π(s) true Thus, the proportion of permutations that make mins∈ d1 π(s) = mins∈ d2 π(s) true is: ′ |I |(|U| − 1)! |I | = = J(d1, d2) |U|! |U| Schu¨tze: Web search 43 / 123

Recap Big picture Ads Duplicate detection Estimating Jaccard Spam Web IR Size of the web Thus, the proportion of successful permutations is the Jaccard coefficient. Schu¨tze: Web search 44 / 123

Recap Big picture Ads Duplicate detection Estimating Jaccard Spam Web IR Size of the web Thus, the proportion of successful permutations is the Jaccard coefficient. Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Schu¨tze: Web search 44 / 123

Recap Big picture Ads Duplicate detection Estimating Jaccard Spam Web IR Size of the web Thus, the proportion of successful permutations is the Jaccard coefficient. Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial. Schu¨tze: Web search 44 / 123

Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Recap Big picture Ads Duplicate detection Estimating Jaccard Spam Web IR Size of the web Thus, the proportion of successful permutations is the Jaccard coefficient. Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial. Estimator of probability of success: proportion of successes in n Bernoulli trials. (n = 200) Schu¨tze: Web search 44 / 123

Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Recap Big picture Ads Duplicate detection Estimating Jaccard Spam Web IR Size of the web Thus, the proportion of successful permutations is the Jaccard coefficient. Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial. Estimator of probability of success: proportion of successes in n Bernoulli trials. (n = 200) Our sketch is based on a random selection of permutations. Schu¨tze: Web search 44 / 123

Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Recap Big picture Ads Duplicate detection Estimating Jaccard Spam Web IR Size of the web Thus, the proportion of successful permutations is the Jaccard coefficient. Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial. Estimator of probability of success: proportion of successes in n Bernoulli trials. (n = 200) Our sketch is based on a random selection of permutations. Thus, to compute Jaccard, count the number k of successful permutations for < d1, d2 > and divide by n = 200. Schu¨tze: Web search 44 / 123

Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Recap Big picture Ads Duplicate detection Estimating Jaccard Spam Web IR Size of the web Thus, the proportion of successful permutations is the Jaccard coefficient. Permutation π is successful iff mins∈ d1 π(s) = mins∈ d2 π(s) Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial. Estimator of probability of success: proportion of successes in n Bernoulli trials. (n = 200) Our sketch is based on a random selection of permutations. Thus, to compute Jaccard, count the number k of successful permutations for < d1, d2 > and divide by n = 200. k/n = k/200 estimates J(d1, d2). Schu¨tze: Web search 44 / 123

[,u oil • J l •• j •• I ••• II• •ll I Implementation

Recap Big picture Ads Implementation Duplicate detection Spam Web IR Size of the web We use hash functions as an efficient type of permutation: hi : {1..2m } → {1..2m } Schu¨tze: Web search 45 / 123

Recap Big picture Ads Implementation Duplicate detection Spam Web IR Size of the web We use hash functions as an efficient type of permutation: hi : {1..2m } → {1..2m } Scan all shingles sk in union of two sets in arbitrary order Schu¨tze: Web search 45 / 123

Recap Big picture Ads Implementation Duplicate detection Spam Web IR Size of the web We use hash functions as an efficient type of permutation: hi : {1..2m } → {1..2m } Scan all shingles sk in union of two sets in arbitrary order For each hash function hi and documents d1, d2, . . .: keep slot for minimum value found so far Schu¨tze: Web search 45 / 123

Recap Big picture Ads Implementation Duplicate detection Spam Web IR Size of the web We use hash functions as an efficient type of permutation: hi : {1..2m } → {1..2m } Scan all shingles sk in union of two sets in arbitrary order For each hash function hi and documents d1, d2, . . .: keep slot for minimum value found so far If hi (sk ) is lower than minimum found so far: update slot Schu¨tze: Web search 45 / 123

[,u oil• Jl •• j • • I ••• II• •ll I Example 4t I 1 ---'

Example d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g h(1) = 1 g (1) = 3 h(2) = 2 g (2) = 0 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g h(1) = 1 g (1) = 3 h(2) = 2 g (2) = 0 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 h(2) = 2 g (2) = 0 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 h(2) = 2 g (2) = 0 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 3 h(2) = 2 g (2) = 0 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 3 – h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 3 – h(2) = 2 g (2) = 0 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 – h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 – h(2) = 2 g (2) = 0 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 – h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 – 2 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 2 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 – h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 – h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 – d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 – d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 1 0 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 1 0 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 final sketches Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 1 0 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 min(h(d1)) = 1 != 0 = min(h(d2)) final sketches Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 1 0 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 min(h(d1)) = 1 != 0 = min(h(d2)) min(g (d1)) = 2 != 0 = min(g (d2)) final sketches Schu¨tze: Web search 46 / 123

Example d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 Recap Big picture Example Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot h g ∞ h(1) = 1 g (1) = 3 1 1 3 3 h(2) = 2 g (2) = 0 1 3 2 2 0 0 h(3) = 3 g (3) = 2 3 1 3 2 2 0 h(4) = 4 g (4) = 4 4 1 4 2 2 h(5) = 0 g (5) = 1 1 0 d1 d2 s1 1 0 s2 0 1 s3 1 1 s4 1 0 s5 0 1 h(x ) = x mod 5 g (x ) = (2x + 1) mod 5 min(h(d1)) = 1 != 0 = min(h(d2)) min(g (d1)) = 2 != 0 = min(g (d2)) Jˆ(d1, d2) = 0+0 = 0 2 final sketches Schu¨tze: Web search 46 / 123

[,u oil • J l •• j •• I ••• II• •ll I Exercise

Exercise d1 d2 d3 s1 1 s2 s3 s4 h(x ) = 5x + 5 mod 4 Recap Big picture Exercise Ads Duplicate detection Spam Web IR Size of the web d1 d2 d3 s1 1 s2 s3 s4 h(x ) = 5x + 5 mod 4 g (x ) = (3x + 1) mod 4 Estimate Jˆ(d1, d2), Jˆ(d1, d3), Jˆ(d2, d3) Schu¨tze: Web search 47 / 123

[,u oil • J l •• j •• I ••• II• •ll I Solution (1)

Solution (1) d1 slot d2 slot d3 slot ∞ h(1) = 2 g (1) = 0 2 2 0 0 Recap Big picture Solution (1) Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot d3 slot ∞ h(1) = 2 g (1) = 0 2 2 0 0 h(2) = 3 g (2) = 3 3 3 2 3 2 3 0 h(3) = 0 g (3) = 2 3 2 0 h(4) = 1 g (4) = 1 1 1 d1 d2 d3 s1 1 s2 s3 s4 h(x ) = 5x + 5 mod 4 g (x ) = (3x + 1) mod 4 Schu¨tze: Web search 48 / 123

Solution (1) d1 slot d2 slot d3 slot ∞ h(1) = 2 g (1) = 0 2 2 0 0 Recap Big picture Solution (1) Ads Duplicate detection Spam Web IR Size of the web d1 slot d2 slot d3 slot ∞ h(1) = 2 g (1) = 0 2 2 0 0 h(2) = 3 g (2) = 3 3 3 2 3 2 3 0 h(3) = 0 g (3) = 2 3 2 0 h(4) = 1 g (4) = 1 1 1 d1 d2 d3 s1 1 s2 s3 s4 h(x ) = 5x + 5 mod 4 g (x ) = (3x + 1) mod 4 final sketches Schu¨tze: Web search 48 / 123

[,u oil • J l •• j •• I ••• II• •ll I Solution (2)

Solution (2) Jˆ(d1, d2) = Jˆ(d1, d3) = Jˆ(d2, d3) = 0 + 0 = 0 2 0 + 0 Recap Big picture Solution (2) Ads Duplicate detection Spam Web IR Size of the web Jˆ(d1, d2) = Jˆ(d1, d3) = Jˆ(d2, d3) = 0 + 0 = 0 2 0 + 0 = 0 2 0 + 1 = 1/2 2 Schu¨tze: Web search

[,u oil • J l •• j •• I ••• II• •ll I Shingling: Summary

Shingling: Summary Input: N documents Recap Big picture Ads Duplicate detection Shingling: Summary Spam Web IR Size of the web Input: N documents Schu¨tze: Web search 50 / 123

Shingling: Summary Input: N documents Recap Big picture Ads Duplicate detection Shingling: Summary Spam Web IR Size of the web Input: N documents Choose n-gram size for shingling, e.g., n = 5 Schu¨tze: Web search 50 / 123

Shingling: Summary Input: N documents Recap Big picture Ads Duplicate detection Shingling: Summary Spam Web IR Size of the web Input: N documents Choose n-gram size for shingling, e.g., n = 5 Pick 200 random permutations, represented as hash functions Schu¨tze: Web search 50 / 123

Shingling: Summary Input: N documents Recap Big picture Ads Duplicate detection Shingling: Summary Spam Web IR Size of the web Input: N documents Choose n-gram size for shingling, e.g., n = 5 Pick 200 random permutations, represented as hash functions Compute N sketches: 200 × N matrix shown on previous slide, one row per permutation, one column per document Schu¨tze: Web search 50 / 123

Shingling: Summary Input: N documents Recap Big picture Ads Duplicate detection Shingling: Summary Spam Web IR Size of the web Input: N documents Choose n-gram size for shingling, e.g., n = 5 Pick 200 random permutations, represented as hash functions Compute N sketches: 200 × N matrix shown on previous slide, one row per permutation, one column per document N·(N−1) Compute pairwise similarities 2 Schu¨tze: Web search 50 / 123

Shingling: Summary Input: N documents Recap Big picture Ads Duplicate detection Shingling: Summary Spam Web IR Size of the web Input: N documents Choose n-gram size for shingling, e.g., n = 5 Pick 200 random permutations, represented as hash functions Compute N sketches: 200 × N matrix shown on previous slide, one row per permutation, one column per document N·(N−1) Compute pairwise similarities 2 Transitive closure of documents with similarity > θ Schu¨tze: Web search 50 / 123

Shingling: Summary Input: N documents Recap Big picture Ads Duplicate detection Shingling: Summary Spam Web IR Size of the web Input: N documents Choose n-gram size for shingling, e.g., n = 5 Pick 200 random permutations, represented as hash functions Compute N sketches: 200 × N matrix shown on previous slide, one row per permutation, one column per document N·(N−1) Compute pairwise similarities 2 Transitive closure of documents with similarity > θ Index only one document from each equivalence class Schu¨tze: Web search 50 / 123

I Efficient near-duplicate detection [,u oil • J l •• j •• I ••• II• •ll I Efficient near-duplicate detection

Efficient near-duplicate detection Recap Big picture Ads Duplicate detection Spam Efficient near-duplicate detection Web IR Size of the web Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of two documents. Schu¨tze: Web search 51 / 123

Efficient near-duplicate detection Recap Big picture Ads Duplicate detection Spam Efficient near-duplicate detection Web IR Size of the web Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of two documents. But we still have to estimate O(N2) coefficients where N is the number of web pages. Schu¨tze: Web search 51 / 123

Efficient near-duplicate detection Recap Big picture Ads Duplicate detection Spam Efficient near-duplicate detection Web IR Size of the web Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of two documents. But we still have to estimate O(N2) coefficients where N is the number of web pages. Still intractable Schu¨tze: Web search 51 / 123

Efficient near-duplicate detection Recap Big picture Ads Duplicate detection Spam Efficient near-duplicate detection Web IR Size of the web Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of two documents. But we still have to estimate O(N2) coefficients where N is the number of web pages. Still intractable One solution: locality sensitive hashing (LSH) Schu¨tze: Web search 51 / 123

Efficient near-duplicate detection Recap Big picture Ads Duplicate detection Spam Efficient near-duplicate detection Web IR Size of the web Now we have an extremely efficient method for estimating a Jaccard coefficient for a single pair of two documents. But we still have to estimate O(N2) coefficients where N is the number of web pages. Still intractable One solution: locality sensitive hashing (LSH) Another solution: sorting (Henzinger 2006) Schu¨tze: Web search 51 / 123

Take-away today Big picture Ads – they pay for the web Recap Big picture Ads Take-away today Duplicate detection Spam Web IR Size of the web Big picture Ads – they pay for the web Duplicate detection – addresses one aspect of chaotic content creation Spam detection – addresses one aspect of lack of central access control Probably won’t get to today Web information retrieval Size of the web Schu¨tze: Web search

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

I The goal of spamming on the web Spdrn I The goal of spamming on the web '4 1 1 ---'

The goal of spamming on the web Recap Big picture Ads Duplicate detection Spam Web IR The goal of spamming on the web Size of the web You have a page that will generate lots of revenue for you if people visit it. Schu¨tze: Web search 54 / 123

The goal of spamming on the web Recap Big picture Ads Duplicate detection Spam Web IR The goal of spamming on the web Size of the web You have a page that will generate lots of revenue for you if people visit it. Therefore, you would like to direct visitors to this page. Schu¨tze: Web search 54 / 123

The goal of spamming on the web Recap Big picture Ads Duplicate detection Spam Web IR The goal of spamming on the web Size of the web You have a page that will generate lots of revenue for you if people visit it. Therefore, you would like to direct visitors to this page. One way of doing this: get your page ranked highly in search results. Schu¨tze: Web search 54 / 123

The goal of spamming on the web Recap Big picture Ads Duplicate detection Spam Web IR The goal of spamming on the web Size of the web You have a page that will generate lots of revenue for you if people visit it. Therefore, you would like to direct visitors to this page. One way of doing this: get your page ranked highly in search results. Exercise: How can I get my page ranked highly? Schu¨tze: Web search 54 / 123

I Spam technique: Keyword stuffing / Hidden text Spd rn I Spam technique: Keyword stuffing / Hidden text

Spam technique: Keyword stuffing / Hidden text Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Spam technique: Keyword stuffing / Hidden text Misleading meta-tags, excessive repetition Schu¨tze: Web search 55 / 123

Spam technique: Keyword stuffing / Hidden text Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Spam technique: Keyword stuffing / Hidden text Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks etc. Schu¨tze: Web search 55 / 123

Spam technique: Keyword stuffing / Hidden text Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Spam technique: Keyword stuffing / Hidden text Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks etc. Used to be very effective, most search engines now catch these Schu¨tze: Web search 55 / 123

Spd rn I Keyword stuffing

Sp.un Keyword stuffing Wlo_.-d_..,. ,...,.. ,....dt•••..._...... lll- sk.oodoa •ou olofw.o•••banbn6ckkyooc:c-pb;>eo=Yo ""'*'--­ alllloeoc:l tu•..r.,.,, ,.......,........,.,-.,_,. ,.,..,.ft•,.,•H .-5dekt)o lleoe.-.htc......,c.,.,.ecoddrmt<lbo<l-.o t""oif.f..,.,-,.._,._......_.. -doe.epo"oct<becwbole ......... b..,.!llk lo., ouold•t fo•ol •u "-f•••f•• ........_,... .....,.._ .._._cwboloode<llllhbo>< lil -_,<lruk toxol•f• •·•• booUr-C-*otod. .,delft-t:&de •uoh4...,,-H .....,. decro-...-z tuMf••n •• t.K a...-.,..;cw...a-of toxMf•h••• ................,....,."o*uo'rl•ft·••·••••.r..w..oomibcntfilf..,......,<lp-oh,pDOOtbwt>Cen ,_,wftloftn' '""....,..,J,••bot>lun fide ,_...., www <etnal ...............,... ,..,.._...., canc.oi<l- on olof,.,,t'•ol•oct-•...,..,..lemiiU- M -···&<lu<>n......,.,., lle .......-. •..r,. -..,,.,..,.,.,..,,.,....,u<Mof on3•d ,_ !idclily tu•..r•on3d muou.l bcn<!llo •..r...ttool weotcoa&!heta ou ol•f•n·d••.......,_.&do:W)'........,.,...,..,. ,u •fl••••••....,.......,..oo<lmcdoc:Of c dtoabMy crnp g<:=ala<><'itroRif .......,al tu CIOI!Otnel1 ¢l""""'•b•ot•u •fl•nolod p.b:_.. ,IOrJ"""""'., hotllaCO •• _0111t<l1.).1.o.,u ••f•n"·••ot.b<.tya ,...,.. tu dcfc•••,.·d •roc m""&le< ""'do•fxcoo ud ••fo,uoltu defOJ oocol. <>l<l\ow tud..fou ••k•,._,n,cnotol ho..th"'"""woceUtcokt •of•onM poc:ilidifo rnefropolol"" _,..ow>eooP""& ltlekernpet tu d..r...,... , pcopl.. n o >onallle tu dofoo t oolf o .... olofon odo-c........,.........,.,caemi'""""'"""""'UIUIUII to ltc-nroachonfotd ••x <l•f•"··fd M,...t:>en<fio:iol.,.,,.,«nlnll.><o s..-coal group""dlootd .................... ......opck... b<onefil uaY)'IIUIO<IIood urton li.dclloy oud•f•n ud paclftt l{e carn>en tu d•fPfl·o.t\. ooulhem """"'"al ldo .,...,...,.,,<rontamcn< o <t=liturtl<>n <<""ol Somolla,JI'f'6"-••l.....-ol ....:.l d.t<toln'll.,.._<""''""""" det......,..,ltt>MI..,._, St ,dou•• .l,.....,old.cno.,,.mo .............,.few>'..n.bi<_..olli&_... &w ·-_,.•.,..,•.....-..n.....cu.o Vr.Sbould to\ ....t.dh..........ll.J.a pMtlcomp-•"'"1'..,..? b..,. cO<hom<n<...a<n<niVr.ot>doutdclll.....,....._< Qnwnl-aowc ...............,., •..a......,..det lidclloyditc'"""'-o Awoe<:t>blo_,.,_,........,.M .HomobeDcie,.._olcnlt_.... Y.oltttnttp&.,a..--.. ..--..,mroou:cot.conu p<O --b"""'..-. ......... .._R....._endoflbo.....dtroo&w-blll. _....---­ ltit"--Nr -.sollftthrR_-.jp<t><ralfew _.,...,..-.lli&-.dooc: - •- Somc"-c u<hdt<"f>ID-.-....,e ""' ._c.-loo - )"<a- """ -..I.S..fl)' ... "tea-o"'""llo Thnollo ...........,o...,......-no de<llllhb011C&C...,lll oo<_ llc_Oknt_.....,....vo ---•b•..•-•1_,-ao: .ooo&<utr...u.td ........_dt<:...,lborty-..llle .......... ...-,........YI..lld'!n!U .....,..... A.ootmoc- ·-.k..t ............ brno&lle -...c- pa.,.Tr--•-c<l""'c--peop brno&olcloorpo.t.loc

I Spam technique: Doorway and lander pages Spd rn I Spam technique: Doorway and lander pages

Spam technique: Doorway and lander pages Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Spam technique: Doorway and lander pages Doorway page: optimized for a single keyword, redirects to the real target page Schu¨tze: Web search 57 / 123

Spam technique: Doorway and lander pages Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Spam technique: Doorway and lander pages Doorway page: optimized for a single keyword, redirects to the real target page Lander page: optimized for a single keyword or a misspelled domain name, designed to attract surfers who will then click on ads Schu¨tze: Web search 57 / 123

Spd rn I Lander page

Lander page Number one hit on Google for the search “composita” Recap Big picture Lander page Ads Duplicate detection Spam Web IR Size of the web Number one hit on Google for the search “composita” Schu¨tze: Web search 58 / 123

Lander page Number one hit on Google for the search “composita” Recap Big picture Lander page Ads Duplicate detection Spam Web IR Size of the web Number one hit on Google for the search “composita” The only purpose of this page: get people to click on the ads and make money for the page owner Schu¨tze: Web search 58 / 123

I Spam technique: Duplication Spd rn I Spam technique: Duplication

Spam technique: Duplication Recap Big picture Ads Duplicate detection Spam Spam technique: Duplication Web IR Size of the web Get good content from somewhere (steal it or produce it yourself) Schu¨tze: Web search 59 / 123

Spam technique: Duplication Recap Big picture Ads Duplicate detection Spam Spam technique: Duplication Web IR Size of the web Get good content from somewhere (steal it or produce it yourself) Publish a large number of slight variations of it Schu¨tze: Web search 59 / 123

Spam technique: Duplication Recap Big picture Ads Duplicate detection Spam Spam technique: Duplication Web IR Size of the web Get good content from somewhere (steal it or produce it yourself) Publish a large number of slight variations of it For example, publish the answer to a tax question with the spelling variations of “tax deferred” on the previous slide Schu¨tze: Web search 59 / 123

I Spam technique: Cloaking Spd rn I Spam technique: Cloaking

Spam technique: Cloaking Recap Big picture Ads Duplicate detection Spam technique: Cloaking Spam Web IR Size of the web Serve fake content to search engine spider Schu¨tze: Web search 60 / 123

Spam technique: Cloaking Recap Big picture Ads Duplicate detection Spam technique: Cloaking Spam Web IR Size of the web Serve fake content to search engine spider So do we just penalize this always? Schu¨tze: Web search 60 / 123

Spam technique: Cloaking Recap Big picture Ads Duplicate detection Spam technique: Cloaking Spam Web IR Size of the web Serve fake content to search engine spider So do we just penalize this always? No: legitimate uses (e.g., different content to US vs. European users) Schu¨tze: Web search 60 / 123

I Spam technique: Link spam Spd rn I Spam technique: Link spam

Spam technique: Link spam Recap Big picture Ads Duplicate detection Spam technique: Link spam Spam Web IR Size of the web Create lots of links pointing to the page you want to promote Schu¨tze: Web search 61 / 123

Spam technique: Link spam Recap Big picture Ads Duplicate detection Spam technique: Link spam Spam Web IR Size of the web Create lots of links pointing to the page you want to promote Put these links on pages with high (or at least non-zero) PageRank Schu¨tze: Web search 61 / 123

Spam technique: Link spam Recap Big picture Ads Duplicate detection Spam technique: Link spam Spam Web IR Size of the web Create lots of links pointing to the page you want to promote Put these links on pages with high (or at least non-zero) PageRank Newly registered domains (domain flooding) Schu¨tze: Web search 61 / 123

Spam technique: Link spam Recap Big picture Ads Duplicate detection Spam technique: Link spam Spam Web IR Size of the web Create lots of links pointing to the page you want to promote Put these links on pages with high (or at least non-zero) PageRank Newly registered domains (domain flooding) A set of pages that all point to each other to boost each other’s PageRank (mutual admiration society) Schu¨tze: Web search 61 / 123

Spam technique: Link spam Recap Big picture Ads Duplicate detection Spam technique: Link spam Spam Web IR Size of the web Create lots of links pointing to the page you want to promote Put these links on pages with high (or at least non-zero) PageRank Newly registered domains (domain flooding) A set of pages that all point to each other to boost each other’s PageRank (mutual admiration society) Pay somebody to put your link on their highly ranked page (“schuetze horoskop” example) Schu¨tze: Web search 61 / 123

Spam technique: Link spam Recap Big picture Ads Duplicate detection Spam technique: Link spam Spam Web IR Size of the web Create lots of links pointing to the page you want to promote Put these links on pages with high (or at least non-zero) PageRank Newly registered domains (domain flooding) A set of pages that all point to each other to boost each other’s PageRank (mutual admiration society) Pay somebody to put your link on their highly ranked page (“schuetze horoskop” example) Leave comments that include the link on blogs Schu¨tze: Web search 61 / 123

I SEO: Search engine optimization Spd rn I SEO: Search engine optimization

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. Schu¨tze: Web search 62 / 123

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. Schu¨tze: Web search 62 / 123

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. Schu¨tze: Web search 62 / 123

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. There are many legitimate reasons for doing this. Schu¨tze: Web search 62 / 123

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. There are many legitimate reasons for doing this. For example, Google bombs like Who is a failure? Schu¨tze: Web search 62 / 123

For example, Google bombs like Who is a failure? Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. There are many legitimate reasons for doing this. For example, Google bombs like Who is a failure? And there are many legitimate ways of achieving this: Schu¨tze: Web search 62 / 123

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. There are many legitimate reasons for doing this. For example, Google bombs like Who is a failure? And there are many legitimate ways of achieving this: Restructure your content in a way that makes it easy to index Schu¨tze: Web search 62 / 123

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. There are many legitimate reasons for doing this. For example, Google bombs like Who is a failure? And there are many legitimate ways of achieving this: Restructure your content in a way that makes it easy to index Talk with influential bloggers and have them link to your site Schu¨tze: Web search 62 / 123

SEO: Search engine optimization Recap Big picture Ads Duplicate detection Spam SEO: Search engine optimization Web IR Size of the web Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. There are many legitimate reasons for doing this. For example, Google bombs like Who is a failure? And there are many legitimate ways of achieving this: Restructure your content in a way that makes it easy to index Talk with influential bloggers and have them link to your site Add more interesting and original content Schu¨tze: Web search 62 / 123

Spd rn I The war against spam

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Combine all of these indicators and use machine learning Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Combine all of these indicators and use machine learning Editorial intervention Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Combine all of these indicators and use machine learning Editorial intervention Blacklists Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Combine all of these indicators and use machine learning Editorial intervention Blacklists Top queries audited Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Combine all of these indicators and use machine learning Editorial intervention Blacklists Top queries audited Complaints addressed Schu¨tze: Web search 63 / 123

The war against spam Quality indicators Recap Big picture Ads Duplicate detection The war against spam Spam Web IR Size of the web Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Combine all of these indicators and use machine learning Editorial intervention Blacklists Top queries audited Complaints addressed Suspect patterns detected Schu¨tze: Web search 63 / 123

I Webmaster guidelines Spd rn I Webmaster guidelines

Recap Big picture Ads Duplicate detection Webmaster guidelines Spam Web IR Size of the web Major search engines have guidelines for webmasters. Schu¨tze: Web search 64 / 123

Recap Big picture Ads Duplicate detection Webmaster guidelines Spam Web IR Size of the web Major search engines have guidelines for webmasters. These guidelines tell you what is legitimate SEO and what is spamming. Schu¨tze: Web search 64 / 123

Recap Big picture Ads Duplicate detection Webmaster guidelines Spam Web IR Size of the web Major search engines have guidelines for webmasters. These guidelines tell you what is legitimate SEO and what is spamming. Ignore these guidelines at your own risk Schu¨tze: Web search 64 / 123

Recap Big picture Ads Duplicate detection Webmaster guidelines Spam Web IR Size of the web Major search engines have guidelines for webmasters. These guidelines tell you what is legitimate SEO and what is spamming. Ignore these guidelines at your own risk Once a search engine identifies you as a spammer, all pages on your site may get low ranks (or disappear from the index entirely). Schu¨tze: Web search 64 / 123

Recap Big picture Ads Duplicate detection Webmaster guidelines Spam Web IR Size of the web Major search engines have guidelines for webmasters. These guidelines tell you what is legitimate SEO and what is spamming. Ignore these guidelines at your own risk Once a search engine identifies you as a spammer, all pages on your site may get low ranks (or disappear from the index entirely). There is often a fine line between spam and legitimate SEO. Schu¨tze: Web search 64 / 123

Recap Big picture Ads Duplicate detection Webmaster guidelines Spam Web IR Size of the web Major search engines have guidelines for webmasters. These guidelines tell you what is legitimate SEO and what is spamming. Ignore these guidelines at your own risk Once a search engine identifies you as a spammer, all pages on your site may get low ranks (or disappear from the index entirely). There is often a fine line between spam and legitimate SEO. Scientific study of fighting spam on the web: adversarial information retrieval Schu¨tze: Web search 64 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

I Web IR: Differences from traditional IR /V ,..t If · I Web IR: Differences from traditional IR

Web IR: Differences from traditional IR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Schu¨tze: Web search 66 / 123

Web IR: Differences from traditional IR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Queries: Web queries are different, more varied and there are a lot of them. Schu¨tze: Web search 66 / 123

Web IR: Differences from traditional IR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Queries: Web queries are different, more varied and there are a lot of them. Users: Users are different, more varied and there are a lot of them. Schu¨tze: Web search 66 / 123

Web IR: Differences from traditional IR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Queries: Web queries are different, more varied and there are a lot of them. Users: Users are different, more varied and there are a lot of them. Documents: Documents are different, more varied and there are a lot of them. Schu¨tze: Web search 66 / 123

Web IR: Differences from traditional IR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Queries: Web queries are different, more varied and there are a lot of them. Users: Users are different, more varied and there are a lot of them. Documents: Documents are different, more varied and there are a lot of them. Context: Context is more important on the web than in many other IR applications. Schu¨tze: Web search 66 / 123

Web IR: Differences from traditional IR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Queries: Web queries are different, more varied and there are a lot of them. How many? Users: Users are different, more varied and there are a lot of them. How many? Documents: Documents are different, more varied and there are a lot of them. How many? Context: Context is more important on the web than in many other IR applications. Ads and spam Schu¨tze: Web search 66 / 123

Web IR: Differences from traditional IR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Queries: Web queries are different, more varied and there are a lot of them. How many? ≈ 109 Users: Users are different, more varied and there are a lot of them. How many? ≈ 109 Documents: Documents are different, more varied and there are a lot of them. How many? ≈ 1011 Context: Context is more important on the web than in many other IR applications. Ads and spam Schu¨tze: Web search 66 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

I Query distribution (1) /V ,..t If · I Query distribution (1)

Recap Big picture Ads Duplicate detection Query distribution (1) Spam Web IR Size of the web Most frequent queries on a large search engine on 2002.10.26. 1 sex 16 crack 31 juegos 46 Caramail 2 (artifact) 17 games 32 nude 47 msn 3 18 pussy 33 music 48 jennifer lopez 4 porno 19 cracks 34 musica 49 tits 5 mp3 20 lolita 35 anal 50 free porn 6 Halloween 21 britney spears 36 free6 51 cheats 7 sexo 22 ebay 37 avril lavigne 52 yahoo.com 8 chat 23 sexe 38 hotmail.com 53 eminem 9 porn 24 Pamela Anderson 39 winzip 54 Christina Aguilera 10 yahoo 25 warez 40 fuck 55 incest 11 KaZaA 26 divx 41 wallpaper 56 letras de canciones 12 xxx 27 gay 42 57 hardcore 13 Hentai 28 harry potter 43 postales 58 weather 14 lyrics 29 playboy 44 shakira 59 wallpapers 15 hotmail 30 lolitas 45 traductor 60 lingerie More than 1/3 of these are queries for adult content. Schu¨tze: Web search 68 / 123

Recap Big picture Ads Duplicate detection Query distribution (1) Spam Web IR Size of the web Most frequent queries on a large search engine on 2002.10.26. 1 sex 16 crack 31 juegos 46 Caramail 2 (artifact) 17 games 32 nude 47 msn 3 18 pussy 33 music 48 jennifer lopez 4 porno 19 cracks 34 musica 49 tits 5 mp3 20 lolita 35 anal 50 free porn 6 Halloween 21 britney spears 36 free6 51 cheats 7 sexo 22 ebay 37 avril lavigne 52 yahoo.com 8 chat 23 sexe 38 hotmail.com 53 eminem 9 porn 24 Pamela Anderson 39 winzip 54 Christina Aguilera 10 yahoo 25 warez 40 fuck 55 incest 11 KaZaA 26 divx 41 wallpaper 56 letras de canciones 12 xxx 27 gay 42 57 hardcore 13 Hentai 28 harry potter 43 postales 58 weather 14 lyrics 29 playboy 44 shakira 59 wallpapers 15 hotmail 30 lolitas 45 traductor 60 lingerie More than 1/3 of these are queries for adult content. Exercise: Does this mean that most people are looking for adult content? Schu¨tze: Web search 68 / 123

I Query distribution (2) /V ,..t If · I Query distribution (2)

Query distribution (2) Queries have a power law distribution. Recap Big picture Ads Duplicate detection Query distribution (2) Spam Web IR Size of the web Queries have a power law distribution. Schu¨tze: Web search 69 / 123

Query distribution (2) Queries have a power law distribution. Recap Big picture Ads Duplicate detection Query distribution (2) Spam Web IR Size of the web Queries have a power law distribution. Recall Zipf’s law: a few very frequent words, a large number of very rare words Schu¨tze: Web search 69 / 123

Query distribution (2) Queries have a power law distribution. Recap Big picture Ads Duplicate detection Query distribution (2) Spam Web IR Size of the web Queries have a power law distribution. Recall Zipf’s law: a few very frequent words, a large number of very rare words Same here: a few very frequent queries, a large number of very rare queries Schu¨tze: Web search 69 / 123

Query distribution (2) Queries have a power law distribution. Recap Big picture Ads Duplicate detection Query distribution (2) Spam Web IR Size of the web Queries have a power law distribution. Recall Zipf’s law: a few very frequent words, a large number of very rare words Same here: a few very frequent queries, a large number of very rare queries Examples of rare queries: search for names, towns, books etc Schu¨tze: Web search 69 / 123

Query distribution (2) Queries have a power law distribution. Recap Big picture Ads Duplicate detection Query distribution (2) Spam Web IR Size of the web Queries have a power law distribution. Recall Zipf’s law: a few very frequent words, a large number of very rare words Same here: a few very frequent queries, a large number of very rare queries Examples of rare queries: search for names, towns, books etc The proportion of adult queries is much lower than 1/3 Schu¨tze: Web search 69 / 123

I Types of queries / user needs in web sea rch /V ,..t If · I Types of queries / user needs in web sea rch

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Navigational user needs: I want to go to this web site. “hotmail”, “myspace”, “United Airlines” Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Navigational user needs: I want to go to this web site. “hotmail”, “myspace”, “United Airlines” Transactional user needs: I want to make a transaction. Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Navigational user needs: I want to go to this web site. “hotmail”, “myspace”, “United Airlines” Transactional user needs: I want to make a transaction. Buy something: “MacBook Air” Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Navigational user needs: I want to go to this web site. “hotmail”, “myspace”, “United Airlines” Transactional user needs: I want to make a transaction. Buy something: “MacBook Air” Download something: “Acrobat Reader” Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Navigational user needs: I want to go to this web site. “hotmail”, “myspace”, “United Airlines” Transactional user needs: I want to make a transaction. Buy something: “MacBook Air” Download something: “Acrobat Reader” Chat with someone: “live soccer chat” Schu¨tze: Web search 70 / 123

Types of queries / user needs in web search Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Navigational user needs: I want to go to this web site. “hotmail”, “myspace”, “United Airlines” Transactional user needs: I want to make a transaction. Buy something: “MacBook Air” Download something: “Acrobat Reader” Chat with someone: “live soccer chat” Difficult problem: How can the search engine tell what the user need or intent for a particular query is? Schu¨tze: Web search 70 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

I Search in a hyperlinked collection /V ,..t If · I Search in a hyperlinked collection

Search in a hyperlinked collection Recap Big picture Ads Duplicate detection Spam Web IR Search in a hyperlinked collection Size of the web Web search in most cases is interleaved with navigation . . . Schu¨tze: Web search 72 / 123

Search in a hyperlinked collection Recap Big picture Ads Duplicate detection Spam Web IR Search in a hyperlinked collection Size of the web Web search in most cases is interleaved with navigation . . . . . . i.e., with following links. Schu¨tze: Web search 72 / 123

Search in a hyperlinked collection Recap Big picture Ads Duplicate detection Spam Web IR Search in a hyperlinked collection Size of the web Web search in most cases is interleaved with navigation . . . . . . i.e., with following links. Different from most other IR collections Schu¨tze: Web search 72 / 123

I ·us··k2 L----------- --- . Go Ie- Kinds of behaviors we see in the data ·us··k2 Short/Nav ---------------- Topic exploration MuHitasking ---.- --- - - ------... : . i : Topic switch -------l_ ·-·..··-·..············-·····-····..-- i'LJ 1 New topic Methodical results exploration L----------- --- I Stacking behavior Query reform Go Ie-

I Bowtie structure of the web /V ,..t If · I Bowtie structure of the web

Bowtie structure of the web Recap Big picture Ads Duplicate detection Bowtie structure of the web Spam Web IR Size of the web Strongly connected component (SCC) in the center Schu¨tze: Web search 74 / 123

Bowtie structure of the web Recap Big picture Ads Duplicate detection Bowtie structure of the web Spam Web IR Size of the web Strongly connected component (SCC) in the center Lots of pages that get linked to, but don’t link (OUT) Schu¨tze: Web search 74 / 123

Bowtie structure of the web Recap Big picture Ads Duplicate detection Bowtie structure of the web Spam Web IR Size of the web Strongly connected component (SCC) in the center Lots of pages that get linked to, but don’t link (OUT) Lots of pages that link to other pages, but don’t get linked to (IN) Schu¨tze: Web search 74 / 123

Bowtie structure of the web Recap Big picture Ads Duplicate detection Bowtie structure of the web Spam Web IR Size of the web Strongly connected component (SCC) in the center Lots of pages that get linked to, but don’t link (OUT) Lots of pages that link to other pages, but don’t get linked to (IN) Tendrils, tubes, islands Schu¨tze: Web search 74 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

I User intent: Answering the need behind the query /V ,..t If · I User intent: Answering the need behind the query

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Schu¨tze: Web search 76 / 123

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Guess user intent independent of context: Schu¨tze: Web search 76 / 123

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Guess user intent independent of context: Spell correction Schu¨tze: Web search 76 / 123

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Guess user intent independent of context: Spell correction Precomputed “typing” of queries (next slide) Schu¨tze: Web search 76 / 123

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Guess user intent independent of context: Spell correction Precomputed “typing” of queries (next slide) Better: Guess user intent based on context: Schu¨tze: Web search 76 / 123

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Guess user intent independent of context: Spell correction Precomputed “typing” of queries (next slide) Better: Guess user intent based on context: Geographic context (slide after next) Schu¨tze: Web search 76 / 123

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Guess user intent independent of context: Spell correction Precomputed “typing” of queries (next slide) Better: Guess user intent based on context: Geographic context (slide after next) Context of user in this session (e.g., previous query) Schu¨tze: Web search 76 / 123

User intent: Answering the need behind the query Recap Big picture Ads Duplicate detection Spam Web IR Size of the web User intent: Answering the need behind the query What can we do to guess user intent? Guess user intent independent of context: Spell correction Precomputed “typing” of queries (next slide) Better: Guess user intent based on context: Geographic context (slide after next) Context of user in this session (e.g., previous query) Context provided by personal profile (Yahoo/MSN do this, Google claims it doesn’t) Schu¨tze: Web search 76 / 123

I Guessing of user intent by "typing" queries /V ,..t If · I Guessing of user intent by "typing" queries

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Currency conversion: 1 euro in kronor Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Currency conversion: 1 euro in kronor Tracking number: 8167 2278 6764 Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Currency conversion: 1 euro in kronor Tracking number: 8167 2278 6764 Flight info: LH 454 Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Currency conversion: 1 euro in kronor Tracking number: 8167 2278 6764 Flight info: LH 454 Area code: 650 Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Currency conversion: 1 euro in kronor Tracking number: 8167 2278 6764 Flight info: LH 454 Area code: 650 Map: columbus oh Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Currency conversion: 1 euro in kronor Tracking number: 8167 2278 6764 Flight info: LH 454 Area code: 650 Map: columbus oh Stock price: msft Schu¨tze: Web search 77 / 123

Guessing of user intent by “typing” queries Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Guessing of user intent by “typing” queries Calculation: 5+4 Unit conversion: 1 kg in pounds Currency conversion: 1 euro in kronor Tracking number: 8167 2278 6764 Flight info: LH 454 Area code: 650 Map: columbus oh Stock price: msft Albums/movies etc: coldplay Schu¨tze: Web search 77 / 123

I The spatial context: Geo-search /V ,..t If · I The spatial context: Geo-search

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user IP address Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user IP address Information provided by user (e.g., in user profile) Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user IP address Information provided by user (e.g., in user profile) Mobile phone Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user IP address Information provided by user (e.g., in user profile) Mobile phone Geo-tagging: Parse text and identify the coordinates of the geographic entities Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user IP address Information provided by user (e.g., in user profile) Mobile phone Geo-tagging: Parse text and identify the coordinates of the geographic entities Example: East Palo Alto CA → Latitude: 37.47 N, Longitude: 122.14 W Schu¨tze: Web search 78 / 123

The spatial context: Geo-search Recap Big picture Ads Duplicate detection Spam Web IR The spatial context: Geo-search Size of the web Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user IP address Information provided by user (e.g., in user profile) Mobile phone Geo-tagging: Parse text and identify the coordinates of the geographic entities Example: East Palo Alto CA → Latitude: 37.47 N, Longitude: 122.14 W Important NLP problem Schu¨tze: Web search 78 / 123

I How do we use context to modify query results? /V ,..t If · I How do we use context to modify query results?

How do we use context to modify query results? Recap Big picture Ads Duplicate detection Spam Web IR Size of the web How do we use context to modify query results? Result restriction: Don’t consider inappropriate results Schu¨tze: Web search 79 / 123

How do we use context to modify query results? Recap Big picture Ads Duplicate detection Spam Web IR Size of the web How do we use context to modify query results? Result restriction: Don’t consider inappropriate results For user on google.fr . . . Schu¨tze: Web search 79 / 123

How do we use context to modify query results? Recap Big picture Ads Duplicate detection Spam Web IR Size of the web How do we use context to modify query results? Result restriction: Don’t consider inappropriate results For user on google.fr . . . . . . only show .fr results Schu¨tze: Web search 79 / 123

How do we use context to modify query results? Recap Big picture Ads Duplicate detection Spam Web IR Size of the web How do we use context to modify query results? Result restriction: Don’t consider inappropriate results For user on google.fr . . . . . . only show .fr results Ranking modulation: use a rough generic ranking, rerank based on personal context Schu¨tze: Web search 79 / 123

How do we use context to modify query results? Recap Big picture Ads Duplicate detection Spam Web IR Size of the web How do we use context to modify query results? Result restriction: Don’t consider inappropriate results For user on google.fr . . . . . . only show .fr results Ranking modulation: use a rough generic ranking, rerank based on personal context Contextualization / personalization is an area of search with a lot of potential for improvement. Schu¨tze: Web search 79 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

/V ,..t If · I Users of web search

Users of web search Use short queries (average < 3) Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Schu¨tze: Web search 81 / 123

Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Rarely use operators Schu¨tze: Web search 81 / 123

Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Rarely use operators Don’t want to spend a lot of time on composing a query Schu¨tze: Web search 81 / 123

Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Rarely use operators Don’t want to spend a lot of time on composing a query Only look at the first couple of results Schu¨tze: Web search 81 / 123

Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Rarely use operators Don’t want to spend a lot of time on composing a query Only look at the first couple of results Want a simple UI, not a search engine start page overloaded with graphics Schu¨tze: Web search 81 / 123

Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Rarely use operators Don’t want to spend a lot of time on composing a query Only look at the first couple of results Want a simple UI, not a search engine start page overloaded with graphics Extreme variability in terms of user needs, user expectations, experience, knowledge, . . . Schu¨tze: Web search 81 / 123

Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Rarely use operators Don’t want to spend a lot of time on composing a query Only look at the first couple of results Want a simple UI, not a search engine start page overloaded with graphics Extreme variability in terms of user needs, user expectations, experience, knowledge, . . . Industrial/developing world, English/Estonian, old/young, rich/poor, differences in culture and class Schu¨tze: Web search 81 / 123

Recap Big picture Ads Duplicate detection Users of web search Spam Web IR Size of the web Use short queries (average < 3) Rarely use operators Don’t want to spend a lot of time on composing a query Only look at the first couple of results Want a simple UI, not a search engine start page overloaded with graphics Extreme variability in terms of user needs, user expectations, experience, knowledge, . . . Industrial/developing world, English/Estonian, old/young, rich/poor, differences in culture and class One interface for hugely divergent needs Schu¨tze: Web search 81 / 123

I How do users evaluate search engines? /V ,..t If · I How do users evaluate search engines?

How do users evaluate search engines? Recap Big picture Ads Duplicate detection Spam Web IR How do users evaluate search engines? Size of the web Classic IR relevance (as measured by F ) can also be used for web IR. Schu¨tze: Web search 82 / 123

How do users evaluate search engines? Recap Big picture Ads Duplicate detection Spam Web IR How do users evaluate search engines? Size of the web Classic IR relevance (as measured by F ) can also be used for web IR. Equally important: Trust, duplicate elimination, readability, loads fast, no pop-ups Schu¨tze: Web search 82 / 123

How do users evaluate search engines? Recap Big picture Ads Duplicate detection Spam Web IR How do users evaluate search engines? Size of the web Classic IR relevance (as measured by F ) can also be used for web IR. Equally important: Trust, duplicate elimination, readability, loads fast, no pop-ups On the web, precision is more important than recall. Schu¨tze: Web search 82 / 123

How do users evaluate search engines? Recap Big picture Ads Duplicate detection Spam Web IR How do users evaluate search engines? Size of the web Classic IR relevance (as measured by F ) can also be used for web IR. Equally important: Trust, duplicate elimination, readability, loads fast, no pop-ups On the web, precision is more important than recall. Precision at 1, precision at 10, precision on the first 2-3 pages Schu¨tze: Web search 82 / 123

How do users evaluate search engines? Recap Big picture Ads Duplicate detection Spam Web IR How do users evaluate search engines? Size of the web Classic IR relevance (as measured by F ) can also be used for web IR. Equally important: Trust, duplicate elimination, readability, loads fast, no pop-ups On the web, precision is more important than recall. Precision at 1, precision at 10, precision on the first 2-3 pages But there is a subset of queries where recall matters. Schu¨tze: Web search 82 / 123

I Web information needs that require high reca ll /V ,..t If · I Web information needs that require high reca ll

Web information needs that require high recall Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web information needs that require high recall Has this idea been patented? Schu¨tze: Web search 83 / 123

Web information needs that require high recall Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web information needs that require high recall Has this idea been patented? Searching for info on a prospective financial advisor Schu¨tze: Web search 83 / 123

Web information needs that require high recall Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web information needs that require high recall Has this idea been patented? Searching for info on a prospective financial advisor Searching for info on a prospective employee Schu¨tze: Web search 83 / 123

Web information needs that require high recall Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web information needs that require high recall Has this idea been patented? Searching for info on a prospective financial advisor Searching for info on a prospective employee Searching for info on a date Schu¨tze: Web search 83 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

I Web doc uments: different from other IR collections /V ,..t If · I Web doc uments: different from other IR collections

Web documents: different from other IR collections Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web documents: different from other IR collections Distributed content creation: no design, no coordination Schu¨tze: Web search 85 / 123

Web documents: different from other IR collections Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web documents: different from other IR collections Distributed content creation: no design, no coordination “Democratization of publishing” Schu¨tze: Web search 85 / 123

Web documents: different from other IR collections Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web documents: different from other IR collections Distributed content creation: no design, no coordination “Democratization of publishing” Result: extreme heterogeneity of documents on the web Schu¨tze: Web search 85 / 123

Web documents: different from other IR collections Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web documents: different from other IR collections Distributed content creation: no design, no coordination “Democratization of publishing” Result: extreme heterogeneity of documents on the web Unstructured (text, html), semistructured (html, xml), structured/relational (databases) Schu¨tze: Web search 85 / 123

Web documents: different from other IR collections Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Web documents: different from other IR collections Distributed content creation: no design, no coordination “Democratization of publishing” Result: extreme heterogeneity of documents on the web Unstructured (text, html), semistructured (html, xml), structured/relational (databases) Dynamically generated content Schu¨tze: Web search 85 / 123

/V ,..t If · I Dynamic content

Recap Big picture Ads Duplicate detection Dynamic content Spam Web IR Size of the web Dynamic pages are generated from scratch when the user requests them – usually from underlying data in a database. Schu¨tze: Web search 86 / 123

Recap Big picture Ads Duplicate detection Dynamic content Spam Web IR Size of the web Dynamic pages are generated from scratch when the user requests them – usually from underlying data in a database. Example: current status of flight LH 454 Schu¨tze: Web search 86 / 123

/V ,..t If · I Dyn ami c content (2)

Recap Big picture Ads Duplicate detection Dynamic content (2) Spam Web IR Size of the web Most (truly) dynamic content is ignored by web spiders. Schu¨tze: Web search 87 / 123

Recap Big picture Ads Duplicate detection Dynamic content (2) Spam Web IR Size of the web Most (truly) dynamic content is ignored by web spiders. It’s too much to index it all. Schu¨tze: Web search 87 / 123

Recap Big picture Ads Duplicate detection Dynamic content (2) Spam Web IR Size of the web Most (truly) dynamic content is ignored by web spiders. It’s too much to index it all. Actually, a lot of “static” content is also assembled on the fly (asp, php etc.: headers, date, ads etc) Schu¨tze: Web search 87 / 123

I Web pages change frequ ently (Fetterl y 1997) /V ,..t If · I Web pages change frequ ently (Fetterl y 1997)

Web pages change frequently (Fetterly 1997) >o..,;, t' ,+u H< ,;+u<u+u+ ,; Webi Lu++>u\u> Web pages change frequently (Fetterly 1997) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% all .com .de .net .uk .jp .org .en .gov .·edu Schotze Web sean:h as 1t23

/V ,..t If · I Multilinguality

Multilinguality Documents in a large number of languages Recap Big picture Ads Multilinguality Duplicate detection Spam Web IR Size of the web Documents in a large number of languages Schu¨tze: Web search 89 / 123

Recap Big picture Ads Multilinguality Duplicate detection Spam Web IR Size of the web Documents in a large number of languages Queries in a large number of languages Schu¨tze: Web search 89 / 123

Recap Big picture Ads Multilinguality Duplicate detection Spam Web IR Size of the web Documents in a large number of languages Queries in a large number of languages First cut: Don’t return English results for a Japanese query Schu¨tze: Web search 89 / 123

Recap Big picture Ads Multilinguality Duplicate detection Spam Web IR Size of the web Documents in a large number of languages Queries in a large number of languages First cut: Don’t return English results for a Japanese query However: Frequent mismatches query/document languages Schu¨tze: Web search 89 / 123

Recap Big picture Ads Multilinguality Duplicate detection Spam Web IR Size of the web Documents in a large number of languages Queries in a large number of languages First cut: Don’t return English results for a Japanese query However: Frequent mismatches query/document languages Many people can understand, but not query in a language Schu¨tze: Web search 89 / 123

Recap Big picture Ads Multilinguality Duplicate detection Spam Web IR Size of the web Documents in a large number of languages Queries in a large number of languages First cut: Don’t return English results for a Japanese query However: Frequent mismatches query/document languages Many people can understand, but not query in a language Translation is important. Schu¨tze: Web search 89 / 123

Recap Big picture Ads Multilinguality Duplicate detection Spam Web IR Size of the web Documents in a large number of languages Queries in a large number of languages First cut: Don’t return English results for a Japanese query However: Frequent mismatches query/document languages Many people can understand, but not query in a language Translation is important. Google example: “Beaujolais Nouveau -wine” Schu¨tze: Web search 89 / 123

/V ,..t If · I Duplicate documents

Recap Big picture Ads Duplicate detection Duplicate documents Spam Web IR Size of the web Significant duplication – 30%–40% duplicates in some studies Schu¨tze: Web search 90 / 123

Recap Big picture Ads Duplicate detection Duplicate documents Spam Web IR Size of the web Significant duplication – 30%–40% duplicates in some studies Duplicates in the search results were common in the early days of the web. Schu¨tze: Web search 90 / 123

Recap Big picture Ads Duplicate detection Duplicate documents Spam Web IR Size of the web Significant duplication – 30%–40% duplicates in some studies Duplicates in the search results were common in the early days of the web. Today’s search engines eliminate duplicates very effectively. Schu¨tze: Web search 90 / 123

Recap Big picture Ads Duplicate detection Duplicate documents Spam Web IR Size of the web Significant duplication – 30%–40% duplicates in some studies Duplicates in the search results were common in the early days of the web. Today’s search engines eliminate duplicates very effectively. Key for high user satisfaction Schu¨tze: Web search 90 / 123

/V ,..t If · I Trust

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Trust For many collections, it is easy to assess the trustworthiness of a document. Schu¨tze: Web search 91 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Trust For many collections, it is easy to assess the trustworthiness of a document. A collection of Reuters newswire articles Schu¨tze: Web search 91 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Trust For many collections, it is easy to assess the trustworthiness of a document. A collection of Reuters newswire articles A collection of TASS (Telegraph Agency of the Soviet Union) newswire articles from the 1980s Schu¨tze: Web search 91 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Trust For many collections, it is easy to assess the trustworthiness of a document. A collection of Reuters newswire articles A collection of TASS (Telegraph Agency of the Soviet Union) newswire articles from the 1980s Your Outlook email from the last three years Schu¨tze: Web search 91 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Trust For many collections, it is easy to assess the trustworthiness of a document. A collection of Reuters newswire articles A collection of TASS (Telegraph Agency of the Soviet Union) newswire articles from the 1980s Your Outlook email from the last three years Web documents are different: In many cases, we don’t know how to evaluate the information. Schu¨tze: Web search 91 / 123

Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Trust For many collections, it is easy to assess the trustworthiness of a document. A collection of Reuters newswire articles A collection of TASS (Telegraph Agency of the Soviet Union) newswire articles from the 1980s Your Outlook email from the last three years Web documents are different: In many cases, we don’t know how to evaluate the information. Hoaxes abound. Schu¨tze: Web search 91 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

/V ,..t If · I Growth of the web

Growth of the web The web keeps growing. Recap Big picture Ads Duplicate detection Growth of the web Spam Web IR Size of the web The web keeps growing. Schu¨tze: Web search 93 / 123

Growth of the web The web keeps growing. Recap Big picture Ads Duplicate detection Growth of the web Spam Web IR Size of the web The web keeps growing. But growth is no longer exponential? Schu¨tze: Web search 93 / 123

I Size of the web: Issues /V ,..t If · I Size of the web: Issues

Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web What is size? Number of web servers? Number of pages? Terabytes of data available? Schu¨tze: Web search 94 / 123

Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web What is size? Number of web servers? Number of pages? Terabytes of data available? Some servers are seldom connected. Schu¨tze: Web search 94 / 123

Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web What is size? Number of web servers? Number of pages? Terabytes of data available? Some servers are seldom connected. Example: Your laptop running a web server Schu¨tze: Web search 94 / 123

Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web What is size? Number of web servers? Number of pages? Terabytes of data available? Some servers are seldom connected. Example: Your laptop running a web server Is it part of the web? Schu¨tze: Web search 94 / 123

Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web What is size? Number of web servers? Number of pages? Terabytes of data available? Some servers are seldom connected. Example: Your laptop running a web server Is it part of the web? The “dynamic” web is infinite. Schu¨tze: Web search 94 / 123

Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web What is size? Number of web servers? Number of pages? Terabytes of data available? Some servers are seldom connected. Example: Your laptop running a web server Is it part of the web? The “dynamic” web is infinite. Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Schu¨tze: Web search 94 / 123

/V ,..t If · I "Sea rch engine index contains N pages" : Issues

“Search engine index contains N pages”: Issues Recap Big picture Ads Duplicate detection Spam Web IR Size of the web “Search engine index contains N pages”: Issues Can I claim a page is in the index if I only index the first 4000 bytes? Schu¨tze: Web search 95 / 123

“Search engine index contains N pages”: Issues Recap Big picture Ads Duplicate detection Spam Web IR Size of the web “Search engine index contains N pages”: Issues Can I claim a page is in the index if I only index the first 4000 bytes? Can I claim a page is in the index if I only index anchor text pointing to the page? Schu¨tze: Web search 95 / 123

“Search engine index contains N pages”: Issues Recap Big picture Ads Duplicate detection Spam Web IR Size of the web “Search engine index contains N pages”: Issues Can I claim a page is in the index if I only index the first 4000 bytes? Can I claim a page is in the index if I only index anchor text pointing to the page? There used to be (and still are?) billions of pages that are only indexed by anchor text. Schu¨tze: Web search 95 / 123

I Simple method for determining a lower bound /V ,..t If · I Simple method for determining a lower bound

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages Schu¨tze: Web search 96 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages Schu¨tze: Web search 96 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages According to this query: Size of web ≥ 21,450,000,000 on 2007.07.07 and ≥ 25,350,000,000 on 2008.07.03 Schu¨tze: Web search 96 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages According to this query: Size of web ≥ 21,450,000,000 on 2007.07.07 and ≥ 25,350,000,000 on 2008.07.03 But page counts of google search results are only rough estimates. Schu¨tze: Web search 96 / 123

Outline 7 Size of the web Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Outline 1 Recap Big picture 2 Ads Duplicate detection Spam 6 Web IR Queries Links Context Users Documents Size 7 Size of the web Schu¨tze: Web search

I Size of the web: Wh o cares? - 1;::.... f tht> V\. E' b I Size of the web: Wh o cares?

Size of the web: Who cares? Recap Big picture Ads Duplicate detection Spam Size of the web: Who cares? Web IR Size of the web Media Schu¨tze: Web search 98 / 123

Size of the web: Who cares? Recap Big picture Ads Duplicate detection Spam Size of the web: Who cares? Web IR Size of the web Media Users Schu¨tze: Web search 98 / 123

Size of the web: Who cares? Recap Big picture Ads Duplicate detection Spam Size of the web: Who cares? Web IR Size of the web Media Users They may switch to the search engine that has the best coverage of the web. Schu¨tze: Web search 98 / 123

Size of the web: Who cares? Recap Big picture Ads Duplicate detection Spam Size of the web: Who cares? Web IR Size of the web Media Users They may switch to the search engine that has the best coverage of the web. Users (sometimes) care about recall. If we underestimate the size of the web, search engine results may have low recall. Schu¨tze: Web search 98 / 123

Size of the web: Who cares? Recap Big picture Ads Duplicate detection Spam Size of the web: Who cares? Web IR Size of the web Media Users They may switch to the search engine that has the best coverage of the web. Users (sometimes) care about recall. If we underestimate the size of the web, search engine results may have low recall. Search engine designers (how many pages do I need to be able to handle?) Schu¨tze: Web search 98 / 123

Size of the web: Who cares? Recap Big picture Ads Duplicate detection Spam Size of the web: Who cares? Web IR Size of the web Media Users They may switch to the search engine that has the best coverage of the web. Users (sometimes) care about recall. If we underestimate the size of the web, search engine results may have low recall. Search engine designers (how many pages do I need to be able to handle?) Crawler designers (which policy will crawl close to N pages?) Schu¨tze: Web search 98 / 123

What is the size of the web? Any guesses? Recap Big picture Ads Duplicate detection Spam Web IR Size of the web What is the size of the web? Any guesses? Schu¨tze: Web search

I Simple method for determining a lower bound - 1;::.... f tht> V\. E' b I Simple method for determining a lower bound

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages Schu¨tze: Web search 100 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages Schu¨tze: Web search 100 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages According to this query: Size of web ≥ 21,450,000,000 on 2007.07.07 Schu¨tze: Web search 100 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages According to this query: Size of web ≥ 21,450,000,000 on 2007.07.07 Big if: Page counts of google search results are correct. (Generally, they are just rough estimates.) Schu¨tze: Web search 100 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages According to this query: Size of web ≥ 21,450,000,000 on 2007.07.07 Big if: Page counts of google search results are correct. (Generally, they are just rough estimates.) But this is just a lower bound, based on one search engine. Schu¨tze: Web search 100 / 123

Simple method for determining a lower bound Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Simple method for determining a lower bound OR-query of frequent words in a number of languages According to this query: Size of web ≥ 21,450,000,000 on 2007.07.07 Big if: Page counts of google search results are correct. (Generally, they are just rough estimates.) But this is just a lower bound, based on one search engine. How can we do better? Schu¨tze: Web search 100 / 123

I Size of the web: Issu es - 1;::.... f tht> V\. E' b I Size of the web: Issu es

Size of the web: Issues The “dynamic” web is infinite. Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web The “dynamic” web is infinite. Schu¨tze: Web search 101 / 123

Size of the web: Issues The “dynamic” web is infinite. Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web The “dynamic” web is infinite. Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Schu¨tze: Web search 101 / 123

Size of the web: Issues The “dynamic” web is infinite. Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web The “dynamic” web is infinite. Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Many other dynamic sites generating infinite number of pages Schu¨tze: Web search 101 / 123

Size of the web: Issues The “dynamic” web is infinite. Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web The “dynamic” web is infinite. Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Many other dynamic sites generating infinite number of pages The static web contains duplicates – each “equivalence class” should only be counted once. Schu¨tze: Web search 101 / 123

Size of the web: Issues The “dynamic” web is infinite. Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web The “dynamic” web is infinite. Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Many other dynamic sites generating infinite number of pages The static web contains duplicates – each “equivalence class” should only be counted once. Some servers are seldom connected. Schu¨tze: Web search 101 / 123

Size of the web: Issues The “dynamic” web is infinite. Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web The “dynamic” web is infinite. Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Many other dynamic sites generating infinite number of pages The static web contains duplicates – each “equivalence class” should only be counted once. Some servers are seldom connected. Example: Your laptop Schu¨tze: Web search 101 / 123

Size of the web: Issues The “dynamic” web is infinite. Recap Big picture Ads Duplicate detection Size of the web: Issues Spam Web IR Size of the web The “dynamic” web is infinite. Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Many other dynamic sites generating infinite number of pages The static web contains duplicates – each “equivalence class” should only be counted once. Some servers are seldom connected. Example: Your laptop Is it part of the web? Schu¨tze: Web search 101 / 123

pages" : Issues - 1;::.... f tht> V\. E' b I "Sea rch engine index contains N pages" : Issues

“Search engine index contains N pages”: Issues Recap Big picture Ads Duplicate detection Spam Web IR Size of the web “Search engine index contains N pages”: Issues Can I claim a page is in the index if I only index the first 4000 bytes? Schu¨tze: Web search 102 / 123

“Search engine index contains N pages”: Issues Recap Big picture Ads Duplicate detection Spam Web IR Size of the web “Search engine index contains N pages”: Issues Can I claim a page is in the index if I only index the first 4000 bytes? Can I claim a page is in the index if I only index anchor text pointing to the page? Schu¨tze: Web search 102 / 123

“Search engine index contains N pages”: Issues Recap Big picture Ads Duplicate detection Spam Web IR Size of the web “Search engine index contains N pages”: Issues Can I claim a page is in the index if I only index the first 4000 bytes? Can I claim a page is in the index if I only index anchor text pointing to the page? There used to be (and still are?) billions of pages that are only indexed by anchor text. Schu¨tze: Web search 102 / 123

How can we estimate the size of the web? Recap Big picture Ads Duplicate detection Spam Web IR Size of the web How can we estimate the size of the web? Schu¨tze: Web search

- 1;::.... f tht> V\. E' b I Sampling methods

Sampling methods Random queries Recap Big picture Ads Duplicate detection Sampling methods Spam Web IR Size of the web Random queries Schu¨tze: Web search 104 / 123

Sampling methods Random queries Random searches Recap Big picture Ads Duplicate detection Sampling methods Spam Web IR Size of the web Random queries Random searches Schu¨tze: Web search 104 / 123

Sampling methods Random queries Random searches Random IP addresses Recap Big picture Ads Duplicate detection Sampling methods Spam Web IR Size of the web Random queries Random searches Random IP addresses Schu¨tze: Web search 104 / 123

Recap Big picture Ads Duplicate detection Sampling methods Spam Web IR Size of the web Random queries Random searches Random IP addresses Random walks Schu¨tze: Web search 104 / 123

I Vari ant : Estimate rel ative sizes of indexes - 1;::.... f tht> V\. E' b I Vari ant : Estimate rel ative sizes of indexes

Variant: Estimate relative sizes of indexes Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Variant: Estimate relative sizes of indexes There are significant differences between indexes of different search engines. Schu¨tze: Web search 105 / 123

Variant: Estimate relative sizes of indexes Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Variant: Estimate relative sizes of indexes There are significant differences between indexes of different search engines. Different engines have different preferences. Schu¨tze: Web search 105 / 123

Variant: Estimate relative sizes of indexes Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Variant: Estimate relative sizes of indexes There are significant differences between indexes of different search engines. Different engines have different preferences. max url depth, max count/host, anti-spam rules, priority rules etc. Schu¨tze: Web search 105 / 123

Variant: Estimate relative sizes of indexes Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Variant: Estimate relative sizes of indexes There are significant differences between indexes of different search engines. Different engines have different preferences. max url depth, max count/host, anti-spam rules, priority rules etc. Different engines index different things under the same URL. Schu¨tze: Web search 105 / 123

Variant: Estimate relative sizes of indexes Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Variant: Estimate relative sizes of indexes There are significant differences between indexes of different search engines. Different engines have different preferences. max url depth, max count/host, anti-spam rules, priority rules etc. Different engines index different things under the same URL. anchor text, frames, meta-keywords, size of prefix etc. Schu¨tze: Web search 105 / 123

.............l .i. Relative Size from Overlap [Bharat & Broder, 98] !- Sample URLs randomly from A Check if contained in B and vice versa !- A t1 B = At1B = {1I2) {116) * Size A * Size B ····· ........ r.l!f-..··-·........ ..... .............l .i. {112)*Size A= {li6)*Size B :. Size A I Size B = {116)1{112) = 113 -!----"! Each test involves: (i) Sampling (ii) Checking

- 1;::.... f tht> V\. E' b I Sampling URLs

Sampling URLs Ideal strategy: Generate a random URL Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Sampling URLs Ideal strategy: Generate a random URL Schu¨tze: Web search 107 / 123

Sampling URLs Ideal strategy: Generate a random URL Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Sampling URLs Ideal strategy: Generate a random URL Problem: Random URLs are hard to find (and sampling distribution should reflect “user interest”) Schu¨tze: Web search 107 / 123

Sampling URLs Ideal strategy: Generate a random URL Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Sampling URLs Ideal strategy: Generate a random URL Problem: Random URLs are hard to find (and sampling distribution should reflect “user interest”) Approach 1: Random walks / IP addresses Schu¨tze: Web search 107 / 123

Sampling URLs Ideal strategy: Generate a random URL Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Sampling URLs Ideal strategy: Generate a random URL Problem: Random URLs are hard to find (and sampling distribution should reflect “user interest”) Approach 1: Random walks / IP addresses In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexex) Schu¨tze: Web search 107 / 123

Sampling URLs Ideal strategy: Generate a random URL Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Sampling URLs Ideal strategy: Generate a random URL Problem: Random URLs are hard to find (and sampling distribution should reflect “user interest”) Approach 1: Random walks / IP addresses In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexex) Approach 2: Generate a random URL contained in a given engine Schu¨tze: Web search 107 / 123

Sampling URLs Ideal strategy: Generate a random URL Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Sampling URLs Ideal strategy: Generate a random URL Problem: Random URLs are hard to find (and sampling distribution should reflect “user interest”) Approach 1: Random walks / IP addresses In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexex) Approach 2: Generate a random URL contained in a given engine Suffices for accurate estimation of relative size Schu¨tze: Web search 107 / 123

I Random URLs from random queries - 1;::.... f tht> V\. E' b I Random URLs from random queries

Random URLs from random queries Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Schu¨tze: Web search 108 / 123

Random URLs from random queries Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Vocabulary can be generated from web crawl Schu¨tze: Web search 108 / 123

Random URLs from random queries Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Vocabulary can be generated from web crawl Use conjunctive queries w1 AND w2 Schu¨tze: Web search 108 / 123

Example: vocalists AND rsi Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Vocabulary can be generated from web crawl Use conjunctive queries w1 AND w2 Example: vocalists AND rsi Schu¨tze: Web search 108 / 123

Example: vocalists AND rsi Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Vocabulary can be generated from web crawl Use conjunctive queries w1 AND w2 Example: vocalists AND rsi Get result set of one hundred URLs from the source engine Schu¨tze: Web search 108 / 123

Example: vocalists AND rsi Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Vocabulary can be generated from web crawl Use conjunctive queries w1 AND w2 Example: vocalists AND rsi Get result set of one hundred URLs from the source engine Choose a random URL from the result set Schu¨tze: Web search 108 / 123

Example: vocalists AND rsi Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Vocabulary can be generated from web crawl Use conjunctive queries w1 AND w2 Example: vocalists AND rsi Get result set of one hundred URLs from the source engine Choose a random URL from the result set This sampling method induces a weight W (p) for each page p. Schu¨tze: Web search 108 / 123

Example: vocalists AND rsi Recap Big picture Ads Duplicate detection Spam Web IR Random URLs from random queries Size of the web Idea: Use vocabulary of the web for query generation Vocabulary can be generated from web crawl Use conjunctive queries w1 AND w2 Example: vocalists AND rsi Get result set of one hundred URLs from the source engine Choose a random URL from the result set This sampling method induces a weight W (p) for each page p. Method was used by Bharat and Broder (1998). Schu¨tze: Web search 108 / 123

I Checking if a page is in the index - 1;::.... f tht> V\. E' b I Checking if a page is in the index

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Call this a strong query for d Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Call this a strong query for d Run query Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Call this a strong query for d Run query Check if d is in result set Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Call this a strong query for d Run query Check if d is in result set Problems Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Call this a strong query for d Run query Check if d is in result set Problems Near duplicates Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Call this a strong query for d Run query Check if d is in result set Problems Near duplicates Redirects Schu¨tze: Web search 109 / 123

Checking if a page is in the index Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Checking if a page is in the index Either: Search for URL if the engine supports this Or: Create a query that will find doc d with high probability Download doc, extract words Use 8 low frequency word as AND query Call this a strong query for d Run query Check if d is in result set Problems Near duplicates Redirects Engine time-outs Schu¨tze: Web search 109 / 123

Computing Relative Sizes and Total Coverage [BB98] Six pair-wise overlaps fah* a - fha * h = E 1 fai* a - fia * i= E2 fae* a - fea * e = E3 fhi* h - fih* i= E4 f he* h - feh* e = Es fei* e - fie* i= E6 Arbitrarily, let a = AltaVista , e = Excite, h = HotBot, i= lnfoseek fxy =fraction of x in y We have 6 equations and 3 unknowns. Solve for e, h and i to minimize E/ Compute engine overlaps. Re-normalize so that the total joint coverage is 100%

Advantages & disadvantages Statistically sound under the induced weight. Biases induced by random query Query Bias: Favors content-rich pages in the language(s) of the lexicon Ranking Bias: Solution: Use conjunctive queries & fetch all Checking Bias: Duplicates, impoverished pages omitted Document or query restriction bias: engine might not deal properly with 8 words conjunctive query Malicious Bias: Sabotage by engine Operational Problems: Time-outs, failures, engine inconsistencies, index modification.

- 1;::.... f tht> V\. E' b I Random searches

Recap Big picture Ads Duplicate detection Random searches Spam Web IR Size of the web Choose random searches extracted from a search engine log (Lawrence & Giles 97) Schu¨tze: Web search 112 / 123

Recap Big picture Ads Duplicate detection Random searches Spam Web IR Size of the web Choose random searches extracted from a search engine log (Lawrence & Giles 97) Use only queries with small result sets Schu¨tze: Web search 112 / 123

Recap Big picture Ads Duplicate detection Random searches Spam Web IR Size of the web Choose random searches extracted from a search engine log (Lawrence & Giles 97) Use only queries with small result sets For each random query: compute ratio size(r1)/size(r2) of the two result sets Schu¨tze: Web search 112 / 123

Recap Big picture Ads Duplicate detection Random searches Spam Web IR Size of the web Choose random searches extracted from a search engine log (Lawrence & Giles 97) Use only queries with small result sets For each random query: compute ratio size(r1)/size(r2) of the two result sets Average over random searches Schu¨tze: Web search 112 / 123

I Advantages & disadvantages - 1;::.... f t h t> V\. E' b I Advantages & disadvantages II '> I L ,

Advantages & disadvantages Recap Big picture Ads Duplicate detection Spam Advantages & disadvantages Web IR Size of the web Advantage Schu¨tze: Web search 113 / 123

Advantages & disadvantages Recap Big picture Ads Duplicate detection Spam Advantages & disadvantages Web IR Size of the web Advantage Might be a better reflection of the human perception of coverage Schu¨tze: Web search 113 / 123

Advantages & disadvantages Recap Big picture Ads Duplicate detection Spam Advantages & disadvantages Web IR Size of the web Advantage Might be a better reflection of the human perception of coverage Issues Schu¨tze: Web search 113 / 123

Advantages & disadvantages Recap Big picture Ads Duplicate detection Spam Advantages & disadvantages Web IR Size of the web Advantage Might be a better reflection of the human perception of coverage Issues Samples are correlated with source of log (unfair advantage for originating search engine) Schu¨tze: Web search 113 / 123

Advantages & disadvantages Recap Big picture Ads Duplicate detection Spam Advantages & disadvantages Web IR Size of the web Advantage Might be a better reflection of the human perception of coverage Issues Samples are correlated with source of log (unfair advantage for originating search engine) Duplicates Schu¨tze: Web search 113 / 123

Advantages & disadvantages Recap Big picture Ads Duplicate detection Spam Advantages & disadvantages Web IR Size of the web Advantage Might be a better reflection of the human perception of coverage Issues Samples are correlated with source of log (unfair advantage for originating search engine) Duplicates Technical statistical problems (must have non-zero results, ratio average not statistically sound) Schu¨tze: Web search 113 / 123

Random searches [Lawr98, Lawr99] 575 & 1050 queries from the NEC Rl employee logs 6 Engines in 1998, 11 in 1999 Implementation: Restricted to queries with < 600 results in total Counted URLs from each engine after verifying query match Computed size ratio & overlap for individual queries Estimated index size ratio & overlap by averaging over all queries

Queries from Lawrence and Giles study adaptive access control neighborhood preservation topographi c hamiltonian structures right linear grammar pulse width modulation neural unbalanced prior probabilities ranked assignment method internet explorer favourites importing karvel thornber zili liu softma x activation function bose multidimensional system theory gamma mlp dvi2pdf john oliensis rieke spikes exploring neural video watermarking counterpropagation network fat shattering dimension abelson amorphous computing

Random IP addresses [Lawrence & Giles '99] Generate random IP addresses Find a web server at the given address If there's one Collect all pages from server. Method first used by O'Neill, McClain, & Lavoie, "A Methodology for Sampling the World Wide Web", 1997.

I Random IP addresses [0Nei97 ,Lawr99] - 1;::.... f tht> V\. E' b I Random IP addresses [0Nei97 ,Lawr99]

Random IP addresses [ONei97,Lawr99] Recap Big picture Ads Duplicate detection Spam Web IR Random IP addresses [ONei97,Lawr99] Size of the web [Lawr99] exhaustively crawled 2500 servers and extrapolated Schu¨tze: Web search 117 / 123

Random IP addresses [ONei97,Lawr99] Recap Big picture Ads Duplicate detection Spam Web IR Random IP addresses [ONei97,Lawr99] Size of the web [Lawr99] exhaustively crawled 2500 servers and extrapolated Estimated size of the web to be 800 million Schu¨tze: Web search 117 / 123

I Advantages and disadvantages - 1;::.... f t h t> V\. E' b I Advantages and disadvantages II ,-; I L ,

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Independent of crawling strategies Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Independent of crawling strategies Disadvantages Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Independent of crawling strategies Disadvantages Many hosts share one IP (→ oversampling) Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Independent of crawling strategies Disadvantages Many hosts share one IP (→ oversampling) Hosts with large web sites don’t get more weight than hosts with small web sites (→ possible undersampling) Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Independent of crawling strategies Disadvantages Many hosts share one IP (→ oversampling) Hosts with large web sites don’t get more weight than hosts with small web sites (→ possible undersampling) Sensitive to spam (multiple IPs for same spam server) Schu¨tze: Web search 118 / 123

Advantages and disadvantages Recap Big picture Ads Duplicate detection Spam Advantages and disadvantages Web IR Size of the web Advantages Can, in theory, estimate the size of the accessible web (as opposed to the (relative) size of an index) Clean statistics Independent of crawling strategies Disadvantages Many hosts share one IP (→ oversampling) Hosts with large web sites don’t get more weight than hosts with small web sites (→ possible undersampling) Sensitive to spam (multiple IPs for same spam server) Again, duplicates Schu¨tze: Web search 118 / 123

Random walks View the Web as a directed graph Build a random walk on this graph Includes va rious "jump" rules back to visited sites Does not get stuck in spider traps! Can follow all links! Converges to a stationary distribution Must assume graph is finite and independent of the walk. Conditions are not satisfied (cookie crumbs, flooding) Time to convergence not really known Sample from stationary distribution of walk Use the "strong query" method to check coverage bySE

Dependence on seed list How well connected is the graph? [Broder et al., WWW9] •

Subject to link sparnming Advantages & disadvantages Advantages "Statistically clean" method at least in theory! Could work even for infinite web (assuming convergence) under certain metrics. Disadvantages List of seeds is a problem. Practical approximation might not be valid. Non-uniform distribution Subject to link sparnming

- 1;::.... f tht> V\. E' b I Conclusion

Conclusion Many different approaches to web size estimation. Recap Big picture Conclusion Ads Duplicate detection Spam Web IR Size of the web Many different approaches to web size estimation. Schu¨tze: Web search 122 / 123

Recap Big picture Conclusion Ads Duplicate detection Spam Web IR Size of the web Many different approaches to web size estimation. None is perfect. Schu¨tze: Web search 122 / 123

Recap Big picture Conclusion Ads Duplicate detection Spam Web IR Size of the web Many different approaches to web size estimation. None is perfect. The problem has gotten much harder. Schu¨tze: Web search 122 / 123

Recap Big picture Conclusion Ads Duplicate detection Spam Web IR Size of the web Many different approaches to web size estimation. None is perfect. The problem has gotten much harder. There hasn’t been a good study for a couple of years. Schu¨tze: Web search 122 / 123

Recap Big picture Conclusion Ads Duplicate detection Spam Web IR Size of the web Many different approaches to web size estimation. None is perfect. The problem has gotten much harder. There hasn’t been a good study for a couple of years. Great topic for a thesis! Schu¨tze: Web search 122 / 123

- 1;::.... f tht> V\. E' b I Reso urces

Resources Chapter 19 of IIR Recap Big picture Ads Duplicate detection Spam Web IR Size of the web Chapter 19 of IIR Schu¨tze: Web search 123 / 123

Resources Chapter 19 of IIR Resources at http://cislmu.org Recap Big picture Resources Ads Duplicate detection Spam Web IR Size of the web Chapter 19 of IIR Resources at http://cislmu.org Schu¨tze: Web search 123 / 123

Resources Chapter 19 of IIR Resources at http://cislmu.org Recap Big picture Resources Ads Duplicate detection Spam Web IR Size of the web Chapter 19 of IIR Resources at http://cislmu.org Schu¨tze: Web search 123 / 123

Resources Chapter 19 of IIR Resources at http://cislmu.org Recap Big picture Resources Ads Duplicate detection Spam Web IR Size of the web Chapter 19 of IIR Resources at http://cislmu.org Hal Varian explains Google second price auction: http://www.youtube.com/watch?v=K7l0a2PVhPQ Size of the web queries Schu¨tze: Web search 123 / 123

Resources Chapter 19 of IIR Resources at http://cislmu.org Recap Big picture Resources Ads Duplicate detection Spam Web IR Size of the web Chapter 19 of IIR Resources at http://cislmu.org Hal Varian explains Google second price auction: http://www.youtube.com/watch?v=K7l0a2PVhPQ Size of the web queries Trademark issues (Geico and Vuitton cases) Schu¨tze: Web search 123 / 123

Resources Chapter 19 of IIR Resources at http://cislmu.org Recap Big picture Resources Ads Duplicate detection Spam Web IR Size of the web Chapter 19 of IIR Resources at http://cislmu.org Hal Varian explains Google second price auction: http://www.youtube.com/watch?v=K7l0a2PVhPQ Size of the web queries Trademark issues (Geico and Vuitton cases) How ads are priced Schu¨tze: Web search 123 / 123

Resources Chapter 19 of IIR Resources at http://cislmu.org Recap Big picture Resources Ads Duplicate detection Spam Web IR Size of the web Chapter 19 of IIR Resources at http://cislmu.org Hal Varian explains Google second price auction: http://www.youtube.com/watch?v=K7l0a2PVhPQ Size of the web queries Trademark issues (Geico and Vuitton cases) How ads are priced Henzinger, Finding near-duplicate web pages: A large-scale evaluation of algorithms, ACM SIGIR 2006. Schu¨tze: Web search 123 / 123