Searching the Web From Chris Manning’s slides for Lecture 8: IIR Chapter 19.

Slides:

Advertisements

Similar presentations

CS276 Information Retrieval and Web Search

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

TrustRank Algorithm Srđan Luković 2010/3482

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 16: Web search basics.

Evaluating Search Engine

ISP 433/533 Week 2 IR Models.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.

Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.

Near Duplicate Detection

CS 345 Data Mining Lecture 1 Introduction to Web Mining.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 19: Web Search 1.

CS276B Text Retrieval and Mining Winter 2005 Lecture 1.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Web search basics.

ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.

Search Engine Optimization (SEO) Week 07 Dynamic Web TCNJ Jean Chu.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Adversarial Information Retrieval The Manipulation of Web Content.

What is web search? Access to “heterogeneous”, distributed information Source of new opportunities in marketing Strains the boundaries of trademark and.

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.

COMP4210 Information Retrieval and Search Engines Lecture 7: Web search basics.

CSCI 5417 Information Retrieval Systems Jim Martin

CS276 Lecture 16 Web characteristics II: Web size measurement Near-duplicate detection.

Data Structures & Algorithms and The Internet: A different way of thinking.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.

Crawling Slides adapted from

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

CS276B Text Retrieval and Mining Winter Lecture 1.

Web Search Basics (Modified from Stanford CS276 Class Lecture 13: Web search basics)

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Information Retrieval (9) Prof. Dragomir R. Radev

Week 1 Introduction to Search Engine Optimization.

Chapter 8: Web Analytics, Web Mining, and Social Analytics

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec

Search Engine Optimization

Modified by Dongwon Lee from slides by

CCT356: Online Advertising and Marketing

Near Duplicate Detection

Information Retrieval Christopher Manning and Prabhakar Raghavan

Lecture 16: Web search/Crawling/Link Analysis

Fred Dirkse CEO, OIC Group, Inc.

CS276A Text Information Retrieval, Mining, and Exploitation

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

Searching the Web From Chris Manning’s slides for Lecture 8: IIR Chapter 19

Basic assumptions of Classic Information Retrieval Corpus: Fixed document collection Goal: Retrieve documents with content that is relevant to user’s information need

Classic IR Goal Classic relevance For each query Q and stored document D in a given corpus assume there exists relevance Score(Q, D) Score is average over users U and contexts C Optimize Score(Q, D) as opposed to Score(Q, D, U, C) That is, usually: Context ignored Individuals ignored Corpus predetermined Bad assumptions in the web context

Algorithmic results =Audience Advertisements =Monetization

Brief (non-technical) history Early keyword-based engines Altavista, Excite, Infoseek, Inktomi, Lycos, ca Sponsored search ranking: Goto.com (morphed into Overture.com  Yahoo!) Your search ranking depended on how much you paid Auction for keywords: casino was expensive!

Brief (non-technical) history 1998+: Link-based ranking pioneered by Google Blew away all early engines except Inktomi Great user experience in search of a business model Meanwhile Goto/Overture’s annual revenues were nearing $1 billion Result: Google added paid-placement “ads” to the side, independent of search results 2003: Yahoo follows suit, acquiring Overture (for paid placement) and Inktomi (for search)

Web search basics The Web Ad indexes Web spider Indexer Indexes Search User

Web search engine pieces Spider (a.k.a. crawler/robot) – builds corpus Collects web pages recursively For each known URL, fetch the page, parse it, and extract new URLs Repeat Additional pages from direct submissions & other sources The indexer – creates inverted indexes Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. Query processor – serves query results Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end – finds matching documents and ranks them

The Web No design/co-ordination Distributed content creation, linking Content includes truth, lies, obsolete information, contradictions … Structured (databases), semi- structured … Scale larger than previous text corpora … (now, corporate records) Growth – slowed down from initial “volume doubling every few months” Content can be dynamically generated The Web

The user Diverse in access methodology Increasingly, high bandwidth connectivity Growing segment of mobile users: limitations of form factor – keyboard, display Diverse in search methodology Search, search + browse, filter by attribute … Average query length ~ 2.3 terms Has to do with what they’re searching for Poor comprehension of syntax Early engines surfaced rich syntax – Boolean, phrase, etc. Current engines hide these

The user Diverse in background/training Although this is improving Few try using the CD ROM drive as a cupholder Increasingly, can tell a search bar from the URL bar Although this matters less now Increasingly, comprehend UI elements such as the vertical slider But browser real estate “above the fold” is still a premiumabove the fold

The user: information needs Informational – want to learn about something (~25%) Navigational – want to go to that page (~40%) Transactional – want to do something (web-mediated) (~35%) Access a service Downloads Shop Gray areas Find a good hub Exploratory search “see what’s there” Low hemoglobin United Airlines Mendocino weather Mars surface images Nikon CoolPix Car rental Finland

Users’ empirical evaluation of results Quality of pages varies widely Relevance is not enough Other desirable qualities (non IR!!) Content: Trustworthy, new info, non-duplicates, well maintained, Web readability: display correctly & fast No annoyances: pop-ups, etc Precision vs. recall On the web, recall seldom matters What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure queries Recall matters when the number of matches is very small User perceptions may be unscientific, but are significant over a large aggregate

Algorithmic results Already, more than a list of docs Moving towards identifying a user’s task (As expressed in the query box) Instead: enabling means for task completion

Sponsored search

A sponsored search ad

Ads go in slots like these

Higher slots get more clicks

Engine: Three sub-problems 1. Match ads to query/context 2. Order the ads 3. Pricing on a click-through IR Econ

Spam Search Engine Optimization

The trouble with search ads … It costs money. What’s the alternative? Search Engine Optimization: “Tuning” your web page to rank highly in the search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients Some perfectly legitimate, some very shady

The spam industry

Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui’ s and resort’ s SEOs responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Term density is not a trusted IR signal

Variants of keyword stuffing Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”

Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Is this a Search Engine spider? Y N SPAM Real Doc Cloaking

More spam techniques Doorway pages Pages optimized for a single keyword that re- direct to the real target page Link spamming Mutual admiration societies, hidden links, awards – more on these later Domain flooding: numerous domains that point or re-direct to a target page Robots Fake query stream – rank checking programs “Curve-fit” ranking programs of search engines Millions of submissions via Add-Url

The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection

More on spam Web search engines have policies on SEO practices they tolerate/block Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research

Size of the web

What is the size of the web ? Issues The web is really infinite Dynamic content, e.g., calendar Soft 404: is a valid page Static web contains syntactic duplication, mostly due to mirroring (~20-30%) Some servers are seldom connected Who cares? Media, and consequently the user Engine design Engine crawl policy. Impact on recall

What can we attempt to measure? The relative size of search engines The notion of a page being indexed is still reasonably well defined. Already there are problems Document extension: e.g. engines indexe pages not yet crawled, by indexing anchortext. Document restriction: Some engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process.

New definition ? IQ is whatever the IQ tests measure. The statically indexable web is whatever search engines index. Different engines have different preferences max url depth, max count/host, anti-spam rules, priority rules, etc. Different engines index different things under the same URL: frames, meta-keywords, document restrictions, document extensions,...

Sampling URLs Ideal strategy: Generate a random URL and check for containment in each index. Problem: Random URLs are hard to find! Enough to generate a random URL contained in a given Engine. Key lesson.

Sampling methods Random queries Random searches Random IP addresses Random walks

A  B = (1/2) * Size A A  B = (1/6) * Size B (1/2)*Size A = (1/6)*Size B  Size A / Size B = (1/6)/(1/2) = 1/3 Sample URLs randomly from A Check if contained in B and vice versa A BA B Each test involves: (i) Sampling (ii) Checking Relative Size from Overlap [Bharat & Broder, 98]

Random URLs from random queries [Bharat & B, 98] Generate random query Lexicon: 400,000+ words from a crawl of Yahoo! Conjunctive Queries: w 1 and w 2 e.g., vocalists AND rsi Get 100 result URLs from the source engine Choose a random URL as the candidate to check for presence in other engines. This distribution induces a probability weight W(p) for each page. Conjecture: W(SE 1 ) / W(SE 2 ) ~= |SE 1 | / |SE 2 |

Query Based Checking “Strong Query” for sample document: Download document. Get list of words. Use 8 low frequency words as conjunctive query Check if sample is present in result set. Problems: Near duplicates Frames Redirects Engine time-outs Might be better to use e.g. 5 distinct conjunctive queries of 6 words each. Document extensions Document restrictions … Sequence of improvements to Bar-Youssef & Gurevich 2006

Random searches Choose random searches extracted from a local log [Lawrence & Giles 97] or build “random searches” [Notess] Use only queries with small results sets. Count normalized URLs in result sets. Use ratio statistics Some particularly flawed implementations also exist

Random searches [ Lawr98, Lawr99] 575 & 1050 queries from the NEC RI employee logs 6 Engines in 1998, 11 in 1999 Implementation: Restricted to queries with < 600 results in total Counted URLs from each engine after verifying query match Computed size ratio & overlap for individual queries Estimated index size ratio & overlap by averaging over all queries Issues Samples are correlated with source of log Duplicates Technical statistical problems (must have non- zero results, ratio average, use harmonic mean? )

adaptive access control neighborhood preservation topographic hamiltonian structures right linear grammar pulse width modulation neural unbalanced prior probabilities ranked assignment method internet explorer favourites importing karvel thornber zili liu Queries from Lawrence and Giles study softmax activation function bose multidimensional system theory gamma mlp dvi2pdf john oliensis rieke spikes exploring neural video watermarking counterpropagation network fat shattering dimension abelson amorphous computing

Random IP addresses [Lawrence & Giles ‘99] Generate random IP addresses Find, if possible, a web server at the given address Collect all pages from server. Method first used by O’Neill, McClain, & Lavoie, “A Methodology for Sampling the World Wide Web”,

Random IP addresses [ONei97, Lawr99] HTTP requests to random IP addresses Ignored: empty or authorization required or excluded [Lawr99] Estimated 2.8 million IP addresses running crawlable web servers (16 million total) from observing 2500 servers. OCLC using IP sampling found 8.7 M hosts in 2001 Netcraft [Netc02] accessed 37.2 million hosts in July 2002 [Lawr99] exhaustively crawled 2500 servers and extrapolated Estimated size of the web to be 800 million Estimated use of metadata descriptors: Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%

Advantages & disadvantages Advantages Clean statistics Independent of crawling strategies Disadvantages Doesn’t deal with duplication Many hosts might share one IP or not accept No guarantee all pages are linked to root page. Power law for # pages/host generates bias towards sites with few pages. But bias can be accurately quantified IF underlying distribution understood Potentially influenced by spamming (multiple IP’s for same server to avoid IP block)

Random walks [Henzinger & al.,‘99 & WWW9] View the Web as a directed graph from a given list of seeds. Build a random walk on this graph Includes various “jump” rules back to visited sites Does not get stuck in spider traps! Can follow all links! Converges to a stationary distribution Must assume graph is finite and independent of the walk. Conditions are not satisfied (cookie crumbs, flooding) Time to convergence not really known Sample from stationary distribution of walk Use the “strong query” method to check coverage by SE Related papers: [Bar Yossef & al, VLDB 2000], [Rusmevichientong & al, 2001], [Bar Yossef & al, 2003]

Advantages & disadvantages Advantages “Statistically clean” method at least in theory! Could work even for infinite web (assuming convergence) under certain metrics. Disadvantages List of seeds is a problem. Practical approximation might not be valid. Non-uniform distribution Subject to link spamming Still has all the problems associated with “strong queries”

Conclusions No sampling solution is perfect. Lots of new ideas but the problem is getting harder Quantitative studies are fascinating and a good research problem

Duplicate detection

Duplicate/Near-Duplicate Detection Duplication: Exact match with fingerprints Near-Duplication: Approximate match Overview Compute syntactic similarity with an edit- distance measure Use similarity threshold to detect near- duplicates E.g., Similarity > 80% => Documents are “near duplicates” Not transitive though sometimes used transitively

Computing Similarity Segments of a document (natural or artificial breakpoints) [Brin95] Shingles (Word k-Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles) Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure

Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive Approximate using a cleverly chosen subset of shingles from each (a sketch) Estimate Jaccard from a short sketch Create a “sketch vector” (e.g., of size 200) for each document Documents which share more than t (say 80%) corresponding vector elements are similar For doc d, sketch d [i] is computed as follows: Let f map all shingles in the universe to 0..2 m Let  i be a specific random permutation on 0..2 m Pick MIN  i (f(s)) over all shingles s in d

Computing Sketch[i] for Doc1 Document Start with 64 bit shingles Permute on the number line with  i Pick the min value

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document Are these equal? Test for 200 random permutations:  ,  ,…  200 AB

However… Document 1 Document A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection) This happens with probability: Size_of_intersection / Size_of_union B A Why?

Set Similarity Set Similarity (Jaccard measure) View sets as columns of a matrix; one row for each element in the universe. a ij = 1 indicates presence of item i in set j Example C 1 C sim J (C 1,C 2 ) = 2/5 =

Key Observation For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim

Min Hashing Randomly permute rows h(C i ) = index of first row with 1 in column C i Surprising Property Why? Both are A/(A+B+C) Look down columns C i, C j until first non-Type- D row h(C i ) = h(C j )  type A row