1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Dimensionality Reduction PCA -- SVD
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Hinrich Schütze and Christina Lioma
10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
1 Conducting a Web Search: Problems & Algorithms Anna Rumshisky.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Singular Value Decomposition
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Link Structure and Web Mining Shuying Wang
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Tutorial 10 Iterative Methods and Matrix Norms. 2 In an iterative process, the k+1 step is defined via: Iterative processes Eigenvector decomposition.
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Link Analysis HITS Algorithm PageRank Algorithm.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Presented By: - Chandrika B N
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
SVD: Singular Value Decomposition
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Document Clustering Based on Non-negative Matrix Factorization
15-499:Algorithms and Applications
Link-Based Ranking Seminar Social Media Mining University UC3M
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Latent Semantic Analysis
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23,

2 Ranking Algorithms

3 PageRank [Page, Brin, Motwani, Winograd 1998] Motivating principles  Rank of p should be proportional to the rank of the pages that point to p Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva  Rank of p should depend on the number of pages “co-cited” with p Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

4 Then:  r is a non-negative normalized left eigenvector of B with eigenvalue 1 PageRank, Attempt #1 Additional Conditions:  r is non-negative:r ≥ 0  r is normalized:||r|| 1 = 1 B = normalized adjacency matrix:

5 PageRank, Attempt #1 Solution exists only if B has eigenvalue 1 Problem: B may not have 1 as an eigenvalue  Because some of its rows are 0.  Example:

6 = normalization constant  r is a non-negative normalized left eigenvector of B with eigenvalue 1/ PageRank, Attempt #2

7 Any nonzero eigenvalue  of B may give a solution  = 1/   r = any non-negative normalized left eigenvector of B with eigenvalue  Which solution to pick?  Pick a “principal eigenvector” (i.e., corresponding to maximal  ) How to find a solution?  Power iterations PageRank, Attempt #2

8 Problem #1: Maximal eigenvalue may have multiplicity > 1  Several possible solutions  Happens, for example, when graph is disconnected Problem #2: Rank accumulates at sinks.  Only sinks or nodes, from which a sink cannot be reached, can have nonzero rank mass. PageRank, Attempt #2

9 Then:  r is a non-negative normalized left eigenvector of (B + 1e T ) with eigenvalue 1/ PageRank, Final Definition e = “rank source” vector  Standard setting: e(p) =  /n for all p (  < 1) 1 = the all 1’s vector

10 Any nonzero eigenvalue of (B + 1e T ) may give a solution Pick r to be a principal left eigenvector of (B + 1e T ) Will show:  Principal eigenvalue has multiplicity 1, for any graph  There exists a non-negative left eigenvector Hence, PageRank always exists and is uniquely defined Due to rank source vector, rank no longer accumulates at sinks PageRank, Final Definition

11 An Alternative View of PageRank: The Random Surfer Model When visiting a page p, a “random surfer”:  With probability 1 - d, selects a random outlink p  q and goes to visit q. (“focused browsing”)  With probability d, jumps to a random web page q. (“loss of interest”)  If p has no outlinks, assume it has a self loop. P: probability transition matrix:

12 PageRank & Random Surfer Model Therefore, r is a principal left eigenvector of (B + 1e T ) if and only if it is a principal left eigenvector of P. Suppose: Then:

13 PageRank & Markov Chains PageRank vector is normalized principal left eigenvector of (B + 1e T ). Hence, PageRank vector is also a principal left eigenvector of P Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain. PageRank(p) = r(p) = probability of random surfer visiting page p at the limit. Note: “Random jump” guarantees Markov Chain is ergodic.

14 HITS: Hubs and Authorities [Kleinberg, 1997] HITS: Hyperlink Induced Topic Search Main principle: every page p is associated with two scores:  Authority score: how “authoritative” a page is about the query’s topic Ex: query: “IR”; authorities: scientific IR papers Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites  Hub score: how good the page is as a “resource list” about the query’s topic Ex: query: “IR”; hubs: surveys and books about IR Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

15 Mutual Reinforcement HITS principles: p is a good authority, if it is linked by many good hubs. p is a good hub, if it points to many good authorities.

16 HITS: Algebraic Form a: authority vector h: hub vector A: adjacency matrix Then: Therefore: a is principal eigenvector of A T A h is principal eigenvector of AA T

17 Co-Citation and Bibilographic Coupling A T A: co-citation matrix  A T A p,q = # of pages that link both to p and to q.  Thus: authority scores propagate through co-citation. AA T : bibliographic coupling matrix  AA T p,q = # of pages that both p and q link to.  Thus: hub scores propagate through bibliographic coupling. p q p q

18 Principal Eigenvector Computation E: n × n matrix | 1 | > | 2 | ≥ | 3 | … ≥ | n | : eigenvalues of E  Suppose 1 > 0 v 1,…,v n : corresponding eigenvectors Eigenvectors are form an orthornormal basis Input:  The matrix E  A unit vector u, which is not orthogonal to v 1 Goal: compute 1 and v 1

19 The Power Method

20 Why Does It Work? Theorem: As t  , w  c · v 1 (c is a constant) Convergence rate: Proportional to ( 2 / 1 ) t The larger the “spectral gap” 2 - 1, the faster the convergence.

21 Spectral Methods in Information Retrieval

22 Outline Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD

23 Synonymy and Polysemy Synonymy: multiple terms with (almost) the same meaning  Ex: cars, autos, vehicles  Harms recall Polysemy: a term with multiple meanings  Ex: java (programming language, coffee, island)  Harms precision

24 Traditional Solutions Query expansion  Synonymy: OR on all synonyms Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision  Polysemy: AND on term and additional specializing terms Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall

25 Syntactic Space D: document collection, |D| = n T: term space, |T| = m A t,d : “weight” of t in d (e.g., TFIDF) A T A: pairwise document similarities AA T : pairwise term similarities A m n terms documents

26 Syntactic Indexing Index keys: terms Limitations  Synonymy (Near)-identical rows  Polysemy  Space inefficiency Matrix usually is not full rank Gap between syntax and semantics: Information need is semantic but index and query are syntactic.

27 Semantic Space C: concept space, |C| = r B c,d : “weight” of c in d Change of basis Compare to wavelet and Fourier transforms B r n concepts documents

28 Latent Semantic Indexing (LSI) [Deerwester et al. 1990] Index keys: concepts Documents & query: mixtures of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap Space-efficient  Concepts are orthogonal  Matrix is full rank Questions  What is the concept space?  What is the transformation from the syntax space to the semantic space?  How to filter “noise concepts”?

29 Singular Values A: m×n real matrix Definition:  ≥ 0 is a singular value of A if there exist a pair of vectors u,v s.t. Av =  u and A T u =  v u and v are called singular vectors. Ex:  = ||A|| 2 = max ||x|| 2 = 1 ||Ax|| 2.  Corresponding singular vectors: x that maximizes ||Ax|| 2 and y = Ax / ||A|| 2. Note: A T Av =  2 v and AA T u =  2 u   2 is eigenvalue of A T A and AA T  u eigenvector of A T A  v eigenvector of AA T

30 Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there exists a singular value decomposition: A = U  V T   1 ≥ … ≥  r > 0 (r = rank(A)): singular values of A   = Diag(  1,…,  r )  U: column-orthonormal m×r matrix (U T U = I )  V: column-orthonormal n×r matrix (V T V = I ) A U VTVT ××  =

31 Singular Values vs. Eigenvalues A = U  V T  1,…,  r : singular values of A   1 2,…,  r 2 : non-zero eigenvalues of A T A and AA T u 1,…,u r : columns of U  Orthonormal basis for span(columns of A)  Left singular vectors of A  Eigenvectors of A T A v 1,…,v r : columns of V  Orthonormal basis for span(rows of A)  Right singular vectors  Eigenvectors of AA T

32 LSI as SVD A = U  V T  U T A =  V T u 1,…,u r : concept basis B =  V T : LSI matrix A d : d-th column of A B d : d-th column of B B d = U T A d B d [c] = u c T A d

33 Noisy Concepts B = U T A =  V T B d [c] =  c v d [c] If  c is small, then B d [c] small for all d k = largest i s.t.  i is “large” For all c = k+1,…,r, and for all d, c is a low- weight concept in d Main idea: filter out all concepts c = k+1,…,r  Space efficient: # of index terms = k (vs. r or m)  Better retrieval: noisy concepts are filtered out across the board

34 Low-rank SVD B = U T A =  V T U k = (u 1,…,u k ) V k = (v 1,…,v k )  k = upper-left k×k sub-matrix of  A k = U k  k V k T B k =  k V k T rank(A k ) = rank(B k ) = k

35 Low Dimensional Embedding Forbenius norm: Fact: Therefore, if is small, then for “most” d,d’,. A k preserves pairwise similarities among documents  at least as good as A for retrieval.

36 Computing SVD Compute singular values of A, by computing eigenvalues of A T A Compute U,V by computing eigenvectors of A T A and AA T Running time not too good: O(m 2 n + m n 2 )  Not practical for huge corpora Sub-linear time algorithms for estimating A k [Frieze,Kannan,Vempala 1998]

37 HITS and SVD A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of A T A h is principal eigenvector of AA T Therefore: a and h give A 1 : the rank-1 SVD of A Generalization: using A k, we can get k authority and hub vectors, corresponding to other topics in G.

38 Why is LSI Better? [Papadimitriou et al. 1998] [Azar et al. 2001] LSI summary  Documents are embedded in low dimensional space (m  k)  Pairwise similarities are preserved  More space-efficient But why is retrieval better?  Synonymy  Polysemy

39 Generative Model A corpus model M = (T,C,W,D)  T: Term space, |T| = m  C: Concept space, |C| = k Concept: distribution over terms  W: Topic space Topic: distribution over concepts  D: Document distribution Distribution over W × N A document d is generated as follows:  Sample a topic w and a length n according to D  Repeat n times: Sample a concept c from C according to w Sample a term t from T according to c

40 Simplifying Assumptions Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 -  The probability of every term under a concept c is at most some constant .

41 LSI Works A: m×n term-document matrix, representing n documents generated according to the model Theorem [Papadimitriou et al. 1998] With high probability, for every two documents d,d’,  If topic(d) = topic(d’), then  If topic(d)  topic(d’), then

42 Proof For simplicity, assume  = 0 Want to show:  If topic(d) = topic(d’), A d k || A d’ k  If topic(d)  topic(d’), A d k  A d’ k D c : documents whose topic is the concept c T c : terms in supp(c)  Since ||c – c’|| = 1, T c ∩ T c’ = Ø A has non-zeroes only in blocks: B 1,…,B k, where B c : sub-matrix of A with rows in T c and columns in D c A T A is a block diagonal matrix with blocks B T 1 B 1,…, B T k B k (i,j)-th entry of B T c B c : term similarity between i-th and j-th documents whose topic is the concept c B T c B c : adjacency matrix of a bipartite (multi-)graph G c on D c

43 Proof (cont.) G c is a “random” graph First and second eigenvalues of B T c B c are well separated For all c,c’, second eigenvalue of B T c B c is smaller than first eigenvalue of B T c’ B c’ Top k eigenvalues of A T A are the principal eigenvalues of B T c B c for c = 1,…,k Let u 1,…,u k be corresponding eigenvectors For every document d on topic c, A d is orthogonal to all u 1,…,u k, except for u c. A k d is a scalar multiple of u c.

44 Extensions [Azar et al. 2001] A more general generative model Explain also improved treatment of polysemy

45 End of Lecture 5