Search Engines Special Topics. Results summaries.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

TrustRank Algorithm Srđan Luković 2010/3482

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

INFORMATION TECHNOLOGY IN BUSINESS AND SOCIETY SESSION 9 – SEARCH AND ADVERTISING SEAN J. TAYLOR.

Link Analysis David Kauchak cs160 Fall 2009 adapted from:

Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.

Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.

Information Retrieval in Practice

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

CS276B Text Retrieval and Mining Winter 2005 Lecture 9.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.

Link Analysis, PageRank and Search Engines on the Web

Crawling and web indexes. Today’s lecture Crawling Connectivity servers.

Near Duplicate Detection

Frequently, presenters must deliver material of a technical nature to an audience unfamiliar with the topic or vocabulary. The material may be complex.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Web search basics.

ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Adversarial Information Retrieval The Manipulation of Web Content.

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.

CS276A Text Information Retrieval, Mining, and Exploitation Lecture November, 2002.

COMP4210 Information Retrieval and Search Engines Lecture 7: Web search basics.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Data Structures & Algorithms and The Internet: A different way of thinking.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.

Crawling Slides adapted from

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Web Search Basics (Modified from Stanford CS276 Class Lecture 13: Web search basics)

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Information Retrieval in Practice

CS276 Information Retrieval and Web Search

Modified by Dongwon Lee from slides by

Near Duplicate Detection

Lecture 16: Web search/Crawling/Link Analysis

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Web Search Engines.

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

Search Engines Special Topics

Results summaries

Summaries Having ranked the documents matching a query, we wish to present a results list Most commonly, the document title plus a short summary The title is typically automatically extracted from document metadata What about the summaries?

Summaries Two basic kinds: Static Dynamic A static summary of a document is always the same, regardless of the query that hit the doc Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand

Static summaries In typical systems, the static summary is a subset of the document Simplest heuristic: the first 50 (or so – this can be varied) words of the document Summary cached at indexing time More sophisticated: extract from each document a set of “key” sentences Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences. Most sophisticated: NLP used to synthesize a summary Seldom used in IR; cf. text summarization work

Dynamic summaries Present one or more “windows” within the document that contain several of the query terms “KWIC” snippets: Keyword in Context presentation Generated in conjunction with scoring If query found as a phrase, the/some occurrences of the phrase in the doc If not, windows within the doc that contain multiple query terms The summary itself gives the entire content of the window – all terms, not only the query terms – how?

Generating dynamic summaries If we have only a positional index, we cannot (easily) reconstruct context surrounding hits If we cache the documents at index time, can run the window through it, cueing to hits found in the positional index E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content Most often, cache a fixed-size prefix of the doc Note: Cached copy can be outdated

Dynamic summaries Producing good dynamic summaries is a tricky optimization problem The real estate for the summary is normally small and fixed Want short item, so show as many KWIC matches as possible, and perhaps other things like title Want snippets to be long enough to be useful Want linguistically well-formed snippets: users prefer snippets that contain complete phrases Want snippets maximally informative about doc But users really like snippets, even if they complicate IR system design

Anti-spamming

Adversarial IR (Spam) Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services Forum Web master world ( ) Search engine specific tricks Discussions about academic papers

Search Engine Optimization I Adversarial IR (“search engine wars”) Search Engine Optimization I Adversarial IR (“search engine wars”)

Can you trust words on the page? Examples from July 2002 auctions.hitsoffice.com/ Pornographic Content

Simplest forms Early engines relied on the density of terms The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s SEOs responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Can’t trust the words on a web page, for ranking.

A few spam technologies Cloaking Serve fake content to search engine robot DNS cloaking: Switch IP address. Impersonate Doorway pages Pages optimized for a single keyword that re- direct to the real target page Keyword Spam Misleading meta-keywords, excessive repetition of a term, fake “anchor text” Hidden text with colors, CSS tricks, etc. Link spamming Mutual admiration societies, hidden links, awards Domain flooding: numerous domains that point or re-direct to a target page Robots Fake click stream Fake query stream Millions of submissions via Add-Url

More spam techniques Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Is this a Search Engine spider? Y N SPAM Real Doc Cloaking

Tutorial on Cloaking & Stealth Technology Tutorial on Cloaking & Stealth Technology

Variants of keyword stuffing Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”

More spam techniques Doorway pages Pages optimized for a single keyword that re- direct to the real target page Link spamming Mutual admiration societies, hidden links, awards – more on these later Domain flooding: numerous domains that point or re-direct to a target page Robots Fake query stream – rank checking programs “Curve-fit” ranking programs of search engines Millions of submissions via Add-Url

The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association)

The war against spam Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed

Acid test Which SEO’s rank highly on the query seo? Web search engines have policies on SEO practices they tolerate/block See pointers in Resources Adversarial IR: the unending (technical) battle between SEO’s and web search engines See for instance

Duplicate detection

Duplicate/Near-Duplicate Detection Duplication: Exact match with fingerprints Near-Duplication: Approximate match Overview Compute syntactic similarity with an edit- distance measure Use similarity threshold to detect near- duplicates E.g., Similarity > 80% => Documents are “near duplicates” Not transitive though sometimes used transitively

Computing Similarity Segments of a document (natural or artificial breakpoints) [Brin95] Shingles (Word k-Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles) Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure

Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive Approximate using a cleverly chosen subset of shingles from each (a sketch) Estimate Jaccard from a short sketch Create a “sketch vector” (e.g., of size 200) for each document Documents which share more than t (say 80%) corresponding vector elements are similar For doc d, sketch d [i] is computed as follows: Let f map all shingles in the universe to 0..2 m Let  i be a specific random permutation on 0..2 m Pick MIN  i (f(s)) over all shingles s in d

Shingling with sampling minima Given two documents A1, A2. Let S1 and S2 be their shingle sets Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|. Let Alpha = min (  (S1)) Let Beta = min (  (S2)) Probability (Alpha = Beta) = Resemblance

Computing Sketch[i] for Doc1 Document Start with 64 bit shingles Permute on the number line with  i Pick the min value

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document Are these equal? Test for 200 random permutations:  ,  ,…  200 AB

However… Document 1 Document A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection) This happens with probability: Size_of_intersection / Size_of_union B A Why?

Set Similarity Set Similarity (Jaccard measure) View sets as columns of a matrix; one row for each element in the universe. a ij = 1 indicates presence of item i in set j Example C 1 C sim J (C 1,C 2 ) = 2/5 =

Key Observation For columns C i, C j, four types of rows C i C j A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim

Min Hashing Randomly permute rows h(C i ) = index of first row with 1 in column C i Surprising Property Why? Both are A/(A+B+C) Look down columns C i, C j until first non-Type- D row h(C i ) = h(C j )  type A row

Question Document D1=D2 iff size_of_intersection=size_of_union ?

Mirror Detection Mirroring is systematic replication of web pages across hosts. Single largest cause of duplication on the web Host1/  and Host2/  are mirrors iff For all (or most) paths p such that when  / p exists  / p exists as well with identical (or near identical) content, and vice versa.

Mirror Detection example and Structural Classification of Proteins

Repackaged Mirrors Auctions.msn.com Auctions.lycos.com Aug 2001

Motivation Why detect mirrors? Smart crawling Fetch from the fastest or freshest server Avoid duplication Better connectivity analysis Combine inlinks Avoid double counting outlinks Redundancy in result listings “If that fails you can try: /samepath” Proxy caching

Maintain clusters of subgraphs Initialize clusters of trivial subgraphs Group near-duplicate single documents into a cluster Subsequent passes Merge clusters of the same cardinality and corresponding linkage Avoid decreasing cluster cardinality To detect mirrors we need: Adequate path overlap Contents of corresponding pages within a small time range Bottom Up Mirror Detection [Cho00]

Can we use URLs to find mirrors? a b c d synthesis.stanford.edu a b c d synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-… synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced… synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-… synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-… synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-… synthesis.stanford.edu/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/annual.report96.final_fn.html synthesis.stanford.edu/Docs/myr5/assessment synthesis.stanford.edu/Docs/myr5/assessment/assessment-… synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-… synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html synthesis.stanford.edu/Docs/myr5/assessment/not-available.html synthesis.stanford.edu/Docs/myr5/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html

Top Down Mirror Detection [Bhar99, Bhar00c] E.g., synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html What features could indicate mirroring? Hostname similarity: word unigrams and bigrams: { www, synthesis, …} Directory similarity: Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … } IP address similarity: 3 or 4 octet overlap Many hosts sharing an IP address => virtual hosting by an ISP Host outlink overlap Path overlap Potentially, path + sketch overlap

Implementation Phase I - Candidate Pair Detection Find features that pairs of hosts have in common Compute a list of host pairs which might be mirrors Phase II - Host Pair Validation Test each host pair and determine extent of mirroring Check if 20 paths sampled from Host1 have near- duplicates on Host2 and vice versa Use transitive inferences: IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B) IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B) Evaluation 140 million URLs on 230,000 hosts (1999) Best approach combined 5 sets of features Top 100,000 host pairs had precision = 0.57 and recall = 0.86

WebIR Infrastructure Connectivity Server Fast access to links to support for link analysis Term Vector Database Fast access to document vectors to augment link analysis

Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01] Fast web graph access to support connectivity analysis Stores mappings in memory from URL to outlinks, URL to inlinks Applications HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc. more on this later Visualizations

Usage Input Graph algorithm + URLs + Values URLs to FPs to IDs Execution Graph algorithm runs in memory IDs to URLs Output URLs + Values Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes

ID assignment Partition URLs into 3 sets, sorted lexicographically High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%) IDs assigned in sequence (densely) E.g., HIGH IDs: Max(indegree, outdegree) > 254 IDURL … … … … Adjacency lists  In memory tables for Outlinks, Inlinks  List index maps from a Source ID to start of adjacency list

Adjacency List Compression - I … … … … List Index Sequence of Adjacency Lists … … … … List Index Delta Encoded Adjacency Lists Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host) - Compress deltas with variable length encoding (e.g., Huffman) List Index pointers: 32b for high, Base+16b for med, Base+8b for low - Avg = 12b per pointer

Adjacency List Compression - II Inter List Compression Basis: Similar URLs may share links Close in ID space => adjacency lists may overlap Approach Define a representative adjacency list for a block of IDs Adjacency list of a reference ID Union of adjacency lists in the block Represent adjacency list in terms of deletions and additions when it is cheaper to do so Measurements Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM) Inter List: bits per link (870M pages/16GB RAM.)

Term Vector Database [Stat00] Fast access to 50 word term vectors for web pages Term Selection: Restricted to middle 1/3 rd of lexicon by document frequency Top 50 words in document by TF.IDF. Term Weighting: Deferred till run-time (can be based on term freq, doc freq, doc length) Applications Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification Performance Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)

Architecture URL Info LC:TID … FRQ:RL … 128 Byte TV Record Terms Freq Base (4 bytes) Bit vector For 480 URLids offset URLid to Term Vector Lookup URLid * 64 /480