15-853Page 1 15-853:Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.

Slides:



Advertisements
Similar presentations
Discrete time Markov Chain
Advertisements

Markov Models.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
11 - Markov Chains Jim Vallandingham.
Random Walks Ben Hescott CS591a1 November 18, 2002.
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 2.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Review.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
1 Lecture 18 Syntactic Web Clustering CS
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Near Duplicate Detection
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Singular Value Decomposition and Data Management
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
15-853Page :Algorithms in the Real World Indexing and Searching (how google and the likes work)
15-499:Algorithms and Applications
Near Duplicate Detection
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal

15-853Page 2 Indexing and Searching Outline Introduction: model, query types Inverted Indices: Compression, Lexicon, Merging Vector Models: Weights, cosine distance Latent Semantic Indexing: Link Analysis: PageRank (Google), HITS Near Duplicate Removal:

15-853Page 3 Link Analysis The goal is to rank pages. We want to take advantage of the link structure to do this. Two main approaches: –Static: we will use the links to calculate a ranking of the pages offline (Google) –Dynamic: we will use the links in the results of a search to dynamically determine a ranking (Clever – hubs and authorities)

15-853Page 4 The Link Graph View documents as graph nodes and the hyperlinks between documents as directed edges. Can give weights on edges (links) based on 1.position in the document 2.weight of anchor term 3.# of occurrences of link 4.… d1 d2 d3 d4 d5 d6 d7 d8 d9

15-853Page 5 The Link Matrix An adjacency matrix of the link graph. Each row corresponds to the Out-Links of a vertex Each column corresponds to the In-links of a vertex Often normalize columns to sum to 1 so the dot- product with self is 1. d1 d2 d3 d4 d5 d6 d7 d8 d9

15-853Page 6 The Link Matrix d1 d2 d3 d4 d5 d6 d7 d8 d

15-853Page 7 Google: Page Rank (1) Goal: to give a total order (rank) on the “importance” of all pages they index Possible Method: count the number of in-links. Problems? d1 d2 d3 d4 d5 d6 d7 d8 d9

15-853Page 8 Google: Page Rank (2) Refined Method: weigh the source by its importance. Seems like this is a self recursive definition. Imagine the following: Jane Surfer, has a lot of time to waste, and randomly selects an outgoing link each time she gets to a page. She counts how often she visits each page. d1 d2 d3 d4 d5 d6 d7 d8 d9 Problem?

15-853Page 9 Side Topic: Markov Chains A discrete time stochastic process is a sequence of random variables {X 0, X 1, …, X n, …} where the 0, 1, …, n, … are discrete points in time. A Markov chain is a discrete-time stochastic process defined over a finite (or countably infinite) set of states S in terms of a matrix P of transition probabilities. Memorylessness property: for a Markov chain Pr[X t+1 = j | X 0 = i 0, X 1 = i 1, …, X t = i] = Pr[X t+1 = j | X t = i]

15-853Page 10 Side Topic: Markov Chains Let  i (t) be the probability of being in state i at time step t. Let  (t) = [  0 (t),  1 (t), … ] be the vector of probabilities at time t. For an initial probability distribution  (0), the probabilities at time n are  (n) =  (0) P n A probability distribution  is stationary if  =  P (i.e. is an eigenvector of P with eigenvalue 1)

15-853Page 11 Side Topic: Markov Chains A markov chain is irreducible if the underlying graph (of P) consists of a single strong component (i.e. all states are reachable from all other states. A period of a state i is the integer t such that P n ii = 0 for all n  t, 2t, 3t, … (i.e. only comes back to itself every t steps). A state is aperiodic if it has period 1. A Markov chain is aperiodic if all states are aperiodic. A state is recurrent if upon entering this state the process will definitely return to the state

15-853Page 12 Side Topic: Markov Chains Fundamental Theorem of Markov Chains: Any irreducible, finite, and aperiodic Markov chain has the following properties: 1.All states are ergodic (aperiodic and recurrent) 2.There is a unique stationary distribution  > 0 for all states. 3.Let N(i,t) be the number of times the Markov chain visits state i in t steps. Then lim t ! 1 N(i,t) =  i t

15-853Page 13 Google: Page Rank (3) Remaining problem: Gives too high of a rank to the Yahoo’s of the world, and too low to clusters with low connectivity into the cluster. Solution: With some probability jump to a random page instead a random outgoing link. Stationary probability of a Markov Chain. Transition probability matrix is simply matrix with equal probability of leaving on each link Transition probability matrix is: U is the uniform matrix, and A is the normalized link matrix (each column sums to 1).

15-853Page 14 Google: Page Rank (4) Want to find , such that T  =  This corresponds to finding the principal eigenvector. Methods: –Power method: iterate p i+1 = Tp i until it settles pick p 0 randomly, decomposing p 0 into eigenvectors a 1 e 1 + a 2 e 2 + … we have p i = 1 i a 1 e i a 2 e 2 + … –Multilevel method: solve on contracted graph, use as initial vector for power method on the expanded graph. –Lancoz method: diagonalize, then iterate All methods are quite expensive

15-853Page 15 Hubs and Authorities (1) Goal: To get a ranking for a particular query (instead of the whole web) Assume: We have a search engine that can give a set of pages P that match a query d1 d2 d3 f1 f2 f3 f4 b1 b2 b3 b4 back set forward set search results P Find all pages that point to P (back set) and that P point to (forward set). Include b, d and f in a document set D.

15-853Page 16 Hubs and Authorities (2) For a particular query we might want to find: 1.The “authorities” on the topic (where the actual information is kept) 2.The “hub” sites that point to the best information on the topic (typically some kind of link site) Important authorities will have many pages that point to them Important hubs will point to many pages But: As with page-rank, just having a lot of pointers does not mean much

15-853Page 17 Hubs and Authorities (3) vector h is a weight for each document in D for how good a hub it is vector a is a weight for each document in D for how good an authority it is T is the adjacency matrix Now repeat: h = T T a a = Th until it settles. f1 f2 f3 f4 h1 h2 h3 h4 hubs authorities

15-853Page 18 Hubs and Authorities (4) What is this process doing? a i+1 = (TT T ) a i h i+1 = (T T T) h i Power method for eigenvectors of TT T and T T T SVD(T) : first column of U and first row of V

15-853Page 19 Hubs and Authorities (5) Problem (Topic Drift): Hubs and Authorities will tend to drift to the general, instead of specific. e.g. you search for “Indexing Algorithms” which identifies all pages with those terms in it, and the back and forward pages. Now you use the SVD to find Hubs + Authorities These will most likely be pages on Algorithms in general rather than Indexing Algorithms. Solution: weight the documents with “indexing” prominently in them, more heavily. a = TWh, h = T T Wa, where W is a diagonal matrix with weights on the diagonal. Same as finding SVD of T’ = TW

15-853Page 20 Extensions What about second eigenvectors? Mix terms and links.

15-853Page 21 Indexing and Searching Outline Introduction: model, query types Inverted Indices: Compression, Lexicon, Merging Vector Models: Weights, cosine distance Latent Semantic Indexing: Link Analysis: PageRank (Google), HITS Near Duplicate Removal:

15-853Page 22 Syntacting Clustering of the Web Problem: Many pages on the web are almost the same but not exactly the same. –Mirror sites with local identifier and/or local links. –Spam sites that try to improve their “pagerank” –Plagiarized sites –Revisions of the same document –Program documentation converted to HTML –Legal documents These near duplicates cause web indexing sites a huge headache.

15-853Page 23 Syntactic Clustering Goal: find pages for which either 1.Most of the content is the same 2.One is contained in the other (possibly with changes Constraints: –Need to do it with only a small amount of information per document: a sketch –Comparison of documents should take time linear in the size of the sketch –Should be reasonably robust against adversary

15-853Page 24 Syntactic Clustering Any ideas?

15-853Page 25 w-Shingling View each shingle as a value. S w (A) : the set of w-shingles for document A. As shorthand, I will often drop the w subscript. w

15-853Page 26 Resemblance and Containment Resemblance: Containment: A B c d cdcd R = A B f e efef C =

15-853Page 27 Calculating Resemblance/Containment Method 1: Keep all shingles and calculate union and intersection of the shingles directly. Problem: can take |D|w space per document (larger than original document) Other ideas? –First 100 shingles? –Shingles from random positions (preselected)? e.g. 23, 786, 1121, 1456, … –Shingles from every k th position?

15-853Page 28 Using Min Valued Shingles How about viewing each shingle as a number and selecting the 100 minimum valued shingles? –e.g. append the ASCII codes Possible problem?

15-853Page 29 Hash How about a random hash? i.e. for each shingle x take h(x), then pick the minimum k values. S k (A) = min k {h(x) : x 2 S(A)}

15-853Page 30 Some Theory: Pairwise independent Universal Hash functions: (Pairwise independent) –H : a finite collection (family) of hash functions mapping U ! {0...m-1} –H is universal if, for h 2 H picked uniformly at random, and for all x 1, x 2 2 U, x 1  x 2 Pr(h(x 1 ) = h(x 2 )) · 1/m The class of hash functions –h ab (x) = ((a x + b) mod p) mod m is universal (p ¸ m is a prime, a = {1…p-1}, b = {0…p-1})

15-853Page 31 Some Theory: Minwise independent Minwise independent permutations: –S n : a finite collection (family) of permutations mapping {1…n} ! {1…n} –H is minwise independent if, for  2 S n picked uniformly at random, and for X µ {1…n}, and all x 2 X Pr(min{  (X)} =  (x)) · 1/|X| It is actually hard to find a “compact” collection of hash functions that is minwise independent, but we can use an approximation.

15-853Page 32 Back to similarity If  2 S n and S n is minwise independent then: This suggests we could just keep one minimum value as our “sketch”, but our confidence would be low. What we want for a sketch of size k is either –use k  ’s, –or keep the k minimum values for one 