Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Information Retrieval Web Search

Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!? Goal of a Search Engine

Two main difficulties The Web: Size: more than tens of billions of pages Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] Google, Yahoo, MSN, ASK,………

This is a search engine!!!

Two new approaches Sponsored search : Ads driven by search keywords (and user-profile issuing them) Context match : Ads driven by the content of a web page (and user-profile reaching that page) AdWords AdSense

Information Retrieval The structure of a Search Engine

The structure Web Crawler Page archive Control Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer

Information Retrieval The Web Graph

The Web’s Characteristics Size 1 trillion of pages is available (Google 7/08) 5-40K per page => hundreds of terabytes Size grows every day!! Change 8% new pages, 25% new links change weekly Life time of about 10 days

The Bow Tie

SCC WCC Some definitions Weakly connected components (WCC) Set of nodes such that from any node can go to any node via an undirected path. Strongly connected components (SCC) Set of nodes such that from any node can go to any node via a directed path.

On observing the Web graph We do not know which percentage of it we know The only way to discover the graph structure of the web as hypertext is via large scale crawls Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and death of nodes and links

Why is it interesting? Largest artifact ever conceived by the human Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization Predict the evolution of the Web Sociological understanding

Many other large graphs… Internet graph V = Routers E = communication links “Cosine” graph (undirected, weighted) V = static web pages E = tf-idf distance between pages Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) where u is a result for q, and has been clicked by some user who issued q Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, email,..)

Definition Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN & no OUT) Three key properties: Skewed distribution: Pb that a node has x links is 1/x ,  ≈ 2.1

The In-degree distribution Altavista crawl, 1999WebBase Crawl 2001 Indegree follows power law distribution

Definition Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN, no OUT) Three key properties: Skewed distribution: Pb that a node has x links is 1/x ,  ≈ 2.1 Locality: usually most of the hyperlinks point to other URLs on the same host (about 80%). Similarity: pages close in lexicographic order tend to share many outgoing lists

A Picture of the Web Graph i j 21 millions of pages, 150millions of links

URL-sorting Stanford Berkeley

Information Retrieval Crawling

Spidering 24h, 7days “walking” over a Graph Recall that the Web graph is A direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes E changes (insert, delete) > 10 links per node BowTie

Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process

Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages Sec. 20.2

Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread Sec. 20.1.1

Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html Website announces its request on what can(not) be crawled For a URL, create a file of restrictions URL/robots.txt Sec. 20.2.1

Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: Sec. 20.2.1

Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL Extract links from it to other docs (URLs) For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) Check if URL has content already seen Duplicate content elimination E.g., only crawl.edu, obey robots.txt, etc. Which one? Sec. 20.2.1

Basic crawl architecture WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch Sec. 20.2.1

Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

BFS “…BFS-order discovers the highest quality pages during the early stages of the crawl” 328 millions of URL in the testbed [Najork 01]

This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is at least 100 bytes on average  Overall we have about 20Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)

Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication  Dynamic assignment  Central coordinator dynamically assigns URLs to crawlers  Links are given to Central coordinator  Static assignment  Web is statically partitioned and assigned to crawlers  Crawler only crawls its part of the web

Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail www.geocities.com/…. www.di.unipi.it/ Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D  D-1, new hash !!! What about new downloaders ? D  D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x

A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers mapped to unit circle Item K assigned to first server N such that ID(N) ≥ ID(K) What if a downloader goes down? What if a new downloader appears? Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ cost/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

Examples: Open Source Nutch, also used by WikiSearch http://www.nutch.org Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html Consisten Hashing Amazon’s Dynamo

Connectivity Server Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to? Stores mappings in memory from URL to outlinks, URL to inlinks Sec. 20.4

Currently the best Webgraph – set of algorithms and a java implementation Fundamental goal – maintain node adjacency lists in memory For this, compressing the adjacency lists is the critical component Sec. 20.4

Adjacency lists The set of neighbors of a node Assume each URL represented by an integer 4 billion pages  32 bits per node And 64 bits per hyperlink Sec. 20.4

Adjaceny list compression Properties exploited in compression: Similarity (between lists) Locality (many links from a page go to “lexic-nearby” pages) Use gap encodings in sorted lists Distribution of gap values Sec. 20.4 3 bits/link

Main ideas Consider lexicographically ordered list of all URLs, e.g., www.stanford.edu/alchemy www.stanford.edu/biology www.stanford.edu/biology/plant www.stanford.edu/biology/plant/copyright www.stanford.edu/biology/plant/people www.stanford.edu/chemistry Sec. 20.4

Copy lists Each of these URLs has an adjacency list Main idea: due to templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering Express adjacency list in terms of one of these E.g., consider these adjacency lists 1, 2, 4, 8, 16, 32, 64 1, 4, 9, 16, 25, 36, 49, 64 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 1, 4, 8, 16, 25, 36, 49, 64 Encode as (-2), 11011111, add 8 Why 7? Sec. 20.4

Extra nodes and binary arrays  Several tricks:  Use RLE over the binary arrays  Use succinct encoding for the intervals created by extra-nodes  Use special interger-codes for the remaining integers  code: good for integers from a power law) Sec. 20.4

Main advantages  Adjacency queries can be answered very efficiently  To fetch out-neighbors, trace back the chain of prototypes  This chain is typically short in practice (since similarity is mostly intra-host)  Can also explicitly limit the length of the chain during encoding  Easy to implement one-pass algorithm Sec. 20.4

Duplicate documents The web is full of duplicated content Strict dup-detection = exact match Not as common Many cases of near duplicates E.g., Last modified date is the only difference between two page copies Sec. 19.6

Duplicate/Near-Duplicate Detection Duplication: Exact match can be detected with fingerprints Near-Duplication: Approximate match Overview Compute syntactic similarity with an edit- distance measure Use similarity threshold to detect near- duplicates E.g., Similarity > 80% => Documents are “near duplicates” Sec. 19.6

Computing Similarity Approach: Shingles (Word N-Grams) a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is a_rose_is_a Similarity Measure between two docs: Set of shingles + Set intersection Sec. 19.6

Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint Documents  Sets of 64-bit fingerprints Efficient shingle management  Fingerprints: Use Karp-Rabin fingerprints Use 64-bit fingerprints Prob[collision] << 1

Similarity of Documents Doc B S B SASA Doc A Jaccard measure – similarity of S A, S B which are sets of integers Claim: A & B are near-duplicates if sim(S A,S B ) is close to 1

Remarks Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision Shingle Size q  [4 … 10] Short shingles: increase similarity of unrelated documents With q=1, sim(S A,S B ) =1  A is permutation of B Need larger q to sensitize to permutation changes Long shingles: small random changes have larger impact Similarity Measure Similarity is non-transitive, non-metric But dissimilarity 1- sim(S A,S B ) is a metric [Ukkonen 92] – relate q-gram & edit-distance

Example A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity q=1  sim(S A,S B ) = 0.7 S A = {a, a, a, is, is, rose, rose, rose} S B = {a, a, a, is, is, rose, rose, flower, which} q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) = 0.3 Disregarding multiplicity q=1  sim(S A,S B ) = 0.6 q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) = 0.4285

Efficiency: Sketches Create a “sketch vector” (of size ~200) for each document Docs that share ≥ t (say 80%) of elemes in the skecthes are near duplicates For doc D, sketch D [ i ] is as follows: Let f map all shingles in the universe to 0..2 m (e.g., f = fingerprinting) Let  i be a random permutation on 0..2 m Pick MIN {  i (f(s))} over all shingles s in D Sec. 19.6

Computing Sketch[i] for Doc1 Document 1 2 64 Start with 64-bit f(shingles) Permute on the number line with  i Pick the min value Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 2 64 Are these equal? Test for 200 random permutations:  ,  ,…  200 MIN(  (f (A) ) Sec. 19.6 same  (f(*)) MIN(  (f (B) )

Notice that… Document 1 Document 2 2 64 They are equal iff the shingle with the MIN value in the union of Doc1 and Doc2 is doc in their intersection Claim: This happens with probability Size_of_intersection / Size_of_union Sec. 19.6

All signature pairs This is an efficient method for estimating the similarity (Jaccard coefficient) for one pair of documents. But we have to estimate N 2 similarities, where N is the number of web pages. Still slow One solution: locality sensitive hashing (LSH) Another solution: sorting Sec. 19.6

Information Retrieval Link-based Ranking (2° generation)

Query-independent ordering First generation: using link counts as simple measures of popularity. Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). Easy to SPAM

Second generation: PageRank Each link has its own importance!! PageRank is independent of the query many interpretations…

Basic Intuition… What about nodes with no in/out links?

Google’s Pagerank B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}. Random jump Principal eigenvector r = [   T + (1-  ) e e T ] × r

Three different interpretations Graph (intuitive interpretation) Co-citation Matrix (easy for computation) Eigenvector computation or a linear system solution Markov Chain (useful to prove convergence) a sort of Usage Simulation Any node  Neighbors  “In the steady state” each page has a long-term visit rate - use this as the page’s score.

Pagerank: use in Search Engines Preprocessing: Given graph, build matrix Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order Query processing: Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent   T + (1-  ) e e T

HITS: Hypertext Induced Topic Search

Calculating HITS It is query-dependent Produces two scores per page: Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Authority and Hub scores 2 3 4 1 1 5 6 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which a i,j = 1 if i  j Thus, h is an eigenvector of AA t a is an eigenvector of A t A Symmetric matrices

Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).

Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Similar presentations

Presentation on theme: "Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Similar presentations

Presentation on theme: "Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query."— Presentation transcript:

Similar presentations

About project

Feedback