Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Slides:



Advertisements
Similar presentations
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Link Analysis, PageRank and Search Engines on the Web
Near Duplicate Detection
Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Ranking Link-based Ranking (2° generation) Reading 21.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
CS276 Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 17 Crawling and web indexes
Quality of a search engine
Near Duplicate Detection
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Anwar Alhenshiri.
Presentation transcript:

Information Retrieval Web Search

Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query : paradigm “bag of words” Relevant ?!? Goal of a Search Engine

Two main difficulties The Web: Size: more than tens of billions of pages Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data AltaVista, Excite, Lycos, etc 1998: Google Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] Google, Yahoo, MSN, ASK,………

This is a search engine!!!

-$ +$

Two new approaches Sponsored search : Ads driven by search keywords (and user-profile issuing them) Context match : Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense

Information Retrieval The structure of a Search Engine

The structure Web Crawler Page archive Control Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer

Information Retrieval The Web Graph

The Web’s Characteristics Size 1 trillion of pages is available (Google 7/08) 5-40K per page => hundreds of terabytes Size grows every day!! Change 8% new pages, 25% new links change weekly Life time of about 10 days

The Bow Tie

SCC WCC Some definitions Weakly connected components (WCC) Set of nodes such that from any node can go to any node via an undirected path. Strongly connected components (SCC) Set of nodes such that from any node can go to any node via a directed path.

On observing the Web graph We do not know which percentage of it we know The only way to discover the graph structure of the web as hypertext is via large scale crawls Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and death of nodes and links

Why is it interesting? Largest artifact ever conceived by the human Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization Predict the evolution of the Web Sociological understanding

Many other large graphs… Internet graph V = Routers E = communication links “Cosine” graph (undirected, weighted) V = static web pages E = tf-idf distance between pages Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) where u is a result for q, and has been clicked by some user who issued q Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, ,..)

Definition Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN & no OUT) Three key properties: Skewed distribution: Pb that a node has x links is 1/x ,  ≈ 2.1

The In-degree distribution Altavista crawl, 1999WebBase Crawl 2001 Indegree follows power law distribution

Definition Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN, no OUT) Three key properties: Skewed distribution: Pb that a node has x links is 1/x ,  ≈ 2.1 Locality: usually most of the hyperlinks point to other URLs on the same host (about 80%). Similarity: pages close in lexicographic order tend to share many outgoing lists

A Picture of the Web Graph i j 21 millions of pages, 150millions of links

URL-sorting Stanford Berkeley

Information Retrieval Crawling

Spidering 24h, 7days “walking” over a Graph Recall that the Web graph is A direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes E changes (insert, delete) > 10 links per node BowTie

Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process

Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages Sec. 20.2

Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread Sec

Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from Website announces its request on what can(not) be crawled For a URL, create a file of restrictions URL/robots.txt Sec

Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: Sec

Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL Extract links from it to other docs (URLs) For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) Check if URL has content already seen Duplicate content elimination E.g., only crawl.edu, obey robots.txt, etc. Which one? Sec

Basic crawl architecture WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch Sec

Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

BFS “…BFS-order discovers the highest quality pages during the early stages of the crawl” 328 millions of URL in the testbed [Najork 01]

This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is at least 100 bytes on average  Overall we have about 20Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)

Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication  Dynamic assignment  Central coordinator dynamically assigns URLs to crawlers  Links are given to Central coordinator  Static assignment  Web is statically partitioned and assigned to crawlers  Crawler only crawls its part of the web

Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D  D-1, new hash !!! What about new downloaders ? D  D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x

A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers mapped to unit circle Item K assigned to first server N such that ID(N) ≥ ID(K) What if a downloader goes down? What if a new downloader appears? Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ cost/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

Examples: Open Source Nutch, also used by WikiSearch Hentrix, used by Archive.org Consisten Hashing Amazon’s Dynamo

Connectivity Server Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to? Stores mappings in memory from URL to outlinks, URL to inlinks Sec. 20.4

Currently the best Webgraph – set of algorithms and a java implementation Fundamental goal – maintain node adjacency lists in memory For this, compressing the adjacency lists is the critical component Sec. 20.4

Adjacency lists The set of neighbors of a node Assume each URL represented by an integer 4 billion pages  32 bits per node And 64 bits per hyperlink Sec. 20.4

Adjaceny list compression Properties exploited in compression: Similarity (between lists) Locality (many links from a page go to “lexic-nearby” pages) Use gap encodings in sorted lists Distribution of gap values Sec bits/link

Main ideas Consider lexicographically ordered list of all URLs, e.g., Sec. 20.4

Copy lists Each of these URLs has an adjacency list Main idea: due to templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering Express adjacency list in terms of one of these E.g., consider these adjacency lists 1, 2, 4, 8, 16, 32, 64 1, 4, 9, 16, 25, 36, 49, 64 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 1, 4, 8, 16, 25, 36, 49, 64 Encode as (-2), , add 8 Why 7? Sec. 20.4

Extra nodes and binary arrays  Several tricks:  Use RLE over the binary arrays  Use succinct encoding for the intervals created by extra-nodes  Use special interger-codes for the remaining integers  code: good for integers from a power law) Sec. 20.4

Main advantages  Adjacency queries can be answered very efficiently  To fetch out-neighbors, trace back the chain of prototypes  This chain is typically short in practice (since similarity is mostly intra-host)  Can also explicitly limit the length of the chain during encoding  Easy to implement one-pass algorithm Sec. 20.4

Duplicate documents The web is full of duplicated content Strict dup-detection = exact match Not as common Many cases of near duplicates E.g., Last modified date is the only difference between two page copies Sec. 19.6

Duplicate/Near-Duplicate Detection Duplication: Exact match can be detected with fingerprints Near-Duplication: Approximate match Overview Compute syntactic similarity with an edit- distance measure Use similarity threshold to detect near- duplicates E.g., Similarity > 80% => Documents are “near duplicates” Sec. 19.6

Computing Similarity Approach: Shingles (Word N-Grams) a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is a_rose_is_a Similarity Measure between two docs: Set of shingles + Set intersection Sec. 19.6

Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint Documents  Sets of 64-bit fingerprints Efficient shingle management  Fingerprints: Use Karp-Rabin fingerprints Use 64-bit fingerprints Prob[collision] << 1

Similarity of Documents Doc B S B SASA Doc A Jaccard measure – similarity of S A, S B which are sets of integers Claim: A & B are near-duplicates if sim(S A,S B ) is close to 1

Remarks Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision Shingle Size q  [4 … 10] Short shingles: increase similarity of unrelated documents With q=1, sim(S A,S B ) =1  A is permutation of B Need larger q to sensitize to permutation changes Long shingles: small random changes have larger impact Similarity Measure Similarity is non-transitive, non-metric But dissimilarity 1- sim(S A,S B ) is a metric [Ukkonen 92] – relate q-gram & edit-distance

Example A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity q=1  sim(S A,S B ) = 0.7 S A = {a, a, a, is, is, rose, rose, rose} S B = {a, a, a, is, is, rose, rose, flower, which} q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) = 0.3 Disregarding multiplicity q=1  sim(S A,S B ) = 0.6 q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) =

Efficiency: Sketches Create a “sketch vector” (of size ~200) for each document Docs that share ≥ t (say 80%) of elemes in the skecthes are near duplicates For doc D, sketch D [ i ] is as follows: Let f map all shingles in the universe to 0..2 m (e.g., f = fingerprinting) Let  i be a random permutation on 0..2 m Pick MIN {  i (f(s))} over all shingles s in D Sec. 19.6

Computing Sketch[i] for Doc1 Document Start with 64-bit f(shingles) Permute on the number line with  i Pick the min value Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document Are these equal? Test for 200 random permutations:  ,  ,…  200 MIN(  (f (A) ) Sec same  (f(*)) MIN(  (f (B) )

Notice that… Document 1 Document They are equal iff the shingle with the MIN value in the union of Doc1 and Doc2 is doc in their intersection Claim: This happens with probability Size_of_intersection / Size_of_union Sec. 19.6

All signature pairs This is an efficient method for estimating the similarity (Jaccard coefficient) for one pair of documents. But we have to estimate N 2 similarities, where N is the number of web pages. Still slow One solution: locality sensitive hashing (LSH) Another solution: sorting Sec. 19.6

Information Retrieval Link-based Ranking (2° generation)

Query-independent ordering First generation: using link counts as simple measures of popularity. Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). Easy to SPAM

Second generation: PageRank Each link has its own importance!! PageRank is independent of the query many interpretations…

Basic Intuition… What about nodes with no in/out links?

Google’s Pagerank B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}. Random jump Principal eigenvector r = [   T + (1-  ) e e T ] × r

Three different interpretations Graph (intuitive interpretation) Co-citation Matrix (easy for computation) Eigenvector computation or a linear system solution Markov Chain (useful to prove convergence) a sort of Usage Simulation Any node  Neighbors  “In the steady state” each page has a long-term visit rate - use this as the page’s score.

Pagerank: use in Search Engines Preprocessing: Given graph, build matrix Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order Query processing: Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent   T + (1-  ) e e T

HITS: Hypertext Induced Topic Search

Calculating HITS It is query-dependent Produces two scores per page: Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Authority and Hub scores a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which a i,j = 1 if i  j Thus, h is an eigenvector of AA t a is an eigenvector of A t A Symmetric matrices

Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).