10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos.

Slides:

Advertisements

Similar presentations

CMU SCS PageRank Brin, Page description: C. Faloutsos, CMU.

Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.

CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.

Information Networks Link Analysis Ranking Lecture 8.

Graphs, Node importance, Link Analysis Ranking, Random walks

Dimensionality Reduction PCA -- SVD

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

15-826: Multimedia Databases and Data Mining

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006

15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.

Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.

Link Analysis, PageRank and Search Engines on the Web

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.

Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.

Link Structure and Web Mining Shuying Wang

Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).

Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.

Singular Value Decomposition and Data Management

E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:

Multimedia Indexing and Dimensionality Reduction.

1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)

Link Analysis HITS Algorithm PageRank Algorithm.

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

Using Adaptive Methods for Updating/Downdating PageRank Gene H. Golub Stanford University SCCM Joint Work With Sep Kamvar, Taher Haveliwala.

CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.

Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Carnegie Mellon Powerful Tools for Data Mining Fractals, power laws, SVD C. Faloutsos Carnegie Mellon University.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

Ranking Link-based Ranking (2° generation) Reading 21.

Linear Subspace Transforms PCA, Karhunen- Loeve, Hotelling C306, 2000.

Introduction to Linear Algebra Mark Goldman Emily Mackevicius.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Chapter 13 Discrete Image Transforms

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Search Engines and Link Analysis on the Web

15-826: Multimedia Databases and Data Mining

Part 2: Graph Mining Tools - SVD and ranking

15-826: Multimedia Databases and Data Mining

15-826: Multimedia Databases and Data Mining

Lecture 22 SVD, Eigenvector, and Web Search

15-826: Multimedia Databases and Data Mining

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Next Generation Data Mining Tools: SVD and Fractals

Junghoo “John” Cho UCLA

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

Presentation transcript:

10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos

Multi DB and D.M.Copyright: C. Faloutsos (2001)2 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

Multi DB and D.M.Copyright: C. Faloutsos (2001)3 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text Singular Value Decomposition (SVD) multimedia...

Multi DB and D.M.Copyright: C. Faloutsos (2001)4 SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case studies Conclusions

Multi DB and D.M.Copyright: C. Faloutsos (2001)5 SVD - detailed outline... Case studies SVD properties more case studies –google/Kleinberg algorithms –query feedbacks Conclusions

Multi DB and D.M.Copyright: C. Faloutsos (2001)6 SVD - Other properties - summary can produce orthogonal basis (obvious) (who cares?) can solve over- and under-determined linear problems (see C(1) property) can compute ‘fixed points’ (= ‘steady state prob. in Markov chains’) (see C(4) property)

Multi DB and D.M.Copyright: C. Faloutsos (2001)7 SVD -outline of properties (A): obvious (B): less obvious (C): least obvious (and most powerful!)

Multi DB and D.M.Copyright: C. Faloutsos (2001)8 Properties - by defn.: A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] A(1): U T [r x n] U [n x r ] = I [r x r ] (identity matrix) A(2): V T [r x n] V [n x r ] = I [r x r ] A(3):  k = diag( 1 k, 2 k,... r k ) (k: ANY real number) A(4): A T = V  U T

Multi DB and D.M.Copyright: C. Faloutsos (2001)9 Less obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = ??

Multi DB and D.M.Copyright: C. Faloutsos (2001)10 Less obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = U  2 U T symmetric; Intuition?

Multi DB and D.M.Copyright: C. Faloutsos (2001)11 Less obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = U  2 U T symmetric; Intuition? ‘document-to-document’ similarity matrix B(2): symmetrically, for ‘V’ (A T ) [m x n] A [n x m] = V  2 V T Intuition?

Multi DB and D.M.Copyright: C. Faloutsos (2001)12 Less obvious properties A: term-to-term similarity matrix B(3): ( (A T ) [m x n] A [n x m] ) k = V  2k V T and B(4): (A T A ) k ~ v 1 1 2k v 1 T for k>>1 where v 1 : [m x 1] first column (eigenvector) of V 1 : strongest eigenvalue

Multi DB and D.M.Copyright: C. Faloutsos (2001)13 Less obvious properties B(4): (A T A ) k ~ v 1 1 2k v 1 T for k>>1 B(5): (A T A ) k v’ ~ (constant) v 1 ie., for (almost) any v’, it converges to a vector parallel to v 1 Thus, useful to compute first eigenvector/value (as well as the next ones, too...)

Multi DB and D.M.Copyright: C. Faloutsos (2001)14 Less obvious properties - repeated: A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = U  2 U T B(2):(A T ) [m x n] A [n x m] = V  2 V T B(3): ( (A T ) [m x n] A [n x m] ) k = V  2k V T B(4): (A T A ) k ~ v 1 1 2k v 1 T B(5): (A T A ) k v’ ~ (constant) v 1

Multi DB and D.M.Copyright: C. Faloutsos (2001)15 Least obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(1): A [n x m] x [m x 1] = b [n x 1] let x 0 = V  (-1) U T b if under-specified, x 0 gives ‘shortest’ solution if over-specified, it gives the ‘solution’ with the smallest least squares error (see Num. Recipes, p. 62)

Multi DB and D.M.Copyright: C. Faloutsos (2001)16 Least obvious properties Illustration: under-specified, eg [1 2] [w z] T = 4 (ie, 1 w + 2 z = 4) all possible solutions x0x0 w z shortest-length solution

Multi DB and D.M.Copyright: C. Faloutsos (2001)17 Least obvious properties Illustration: over-specified, eg [3 2] T [w] = [1 2] T (ie, 3 w = 1; 2 w = 2 ) reachable points (3w, 2w) desirable point b

Multi DB and D.M.Copyright: C. Faloutsos (2001)18 Least obvious properties - cont’d A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(2): A [n x m] v 1 [m x 1] = 1 u 1 [n x 1] where v 1, u 1 the first (column) vectors of V, U. (v 1 == right-eigenvector) C(3): symmetrically: u 1 T A = 1 v 1 T u 1 == left-eigenvector Therefore:

Multi DB and D.M.Copyright: C. Faloutsos (2001)19 Least obvious properties - cont’d A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(4): A T A v 1 = 1 2 v 1 (fixed point - the usual dfn of eigenvector for a symmetric matrix)

Multi DB and D.M.Copyright: C. Faloutsos (2001)20 Least obvious properties - altogether A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(1): A [n x m] x [m x 1] = b [n x 1] then, x 0 = V  (-1) U T b: shortest, actual or least- squares solution C(2): A [n x m] v 1 [m x 1] = 1 u 1 [n x 1] C(3): u 1 T A = 1 v 1 T C(4): A T A v 1 = 1 2 v 1

Multi DB and D.M.Copyright: C. Faloutsos (2001)21 Properties - conclusions A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(5): (A T A ) k v’ ~ (constant) v 1 C(1): A [n x m] x [m x 1] = b [n x 1] then, x 0 = V  (-1) U T b: shortest, actual or least- squares solution C(4): A T A v 1 = 1 2 v 1

Multi DB and D.M.Copyright: C. Faloutsos (2001)22 SVD - detailed outline... Case studies SVD properties more case studies –Kleinberg/google algorithms –query feedbacks Conclusions

Multi DB and D.M.Copyright: C. Faloutsos (2001)23 Kleinberg’s algorithm Problem dfn: given the web and a query find the most ‘authoritative’ web pages for this query Step 0: find all pages containing the query terms Step 1: expand by one move forward and backward

Multi DB and D.M.Copyright: C. Faloutsos (2001)24 Kleinberg’s algorithm Step 1: expand by one move forward and backward

Multi DB and D.M.Copyright: C. Faloutsos (2001)25 Kleinberg’s algorithm on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to give high importance score (‘hubs’) to nodes that point to good ‘authorities’) hubsauthorities

Multi DB and D.M.Copyright: C. Faloutsos (2001)26 Kleinberg’s algorithm observations recursive definition! each node (say, ‘i’-th node) has both an authoritativeness score a i and a hubness score h i

Multi DB and D.M.Copyright: C. Faloutsos (2001)27 Kleinberg’s algorithm Let E be the set of edges and A be the adjacency matrix: the (i,j) is 1 if the edge from i to j exists Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores. Then:

Multi DB and D.M.Copyright: C. Faloutsos (2001)28 Kleinberg’s algorithm Then: a i = h k + h l + h m that is a i = Sum (h j ) over all j that (j,i) edge exists or a = A T h k l m i

Multi DB and D.M.Copyright: C. Faloutsos (2001)29 Kleinberg’s algorithm symmetrically, for the ‘hubness’: h i = a n + a p + a q that is h i = Sum (q j ) over all j that (i,j) edge exists or h = A a p n q i

Multi DB and D.M.Copyright: C. Faloutsos (2001)30 Kleinberg’s algorithm In conclusion, we want vectors h and a such that: h = A a a = A T h Recall properties: C(2): A [n x m] v 1 [m x 1] = 1 u 1 [n x 1] C(3): u 1 T A = 1 v 1 T

Multi DB and D.M.Copyright: C. Faloutsos (2001)31 Kleinberg’s algorithm In short, the solutions to h = A a a = A T h are the left- and right- eigenvectors of the adjacency matrix A. Starting from random a’ and iterating, we’ll eventually converge (Q: to which of all the eigenvectors? why?)

Multi DB and D.M.Copyright: C. Faloutsos (2001)32 Kleinberg’s algorithm (Q: to which of all the eigenvectors? why?) A: to the ones of the strongest eigenvalue, because of property B(5): B(5): (A T A ) k v’ ~ (constant) v 1

Multi DB and D.M.Copyright: C. Faloutsos (2001)33 Kleinberg’s algorithm - results Eg., for the query ‘java’: java.sun.com (“the java developer”)

Multi DB and D.M.Copyright: C. Faloutsos (2001)34 Kleinberg’s algorithm - discussion ‘authority’ score can be used to find ‘similar pages’ (how?) closely related to ‘citation analysis’, social networs / ‘small world’ phenomena

Multi DB and D.M.Copyright: C. Faloutsos (2001)35 google/page-rank algorithm closely related: imagine a particle randomly moving along the edges (*) compute its steady-state probabilities (*) with occasional random jumps

Multi DB and D.M.Copyright: C. Faloutsos (2001)36 google/page-rank algorithm ~identical problem: given a Markov Chain, compute the steady state probabilities p1... p

Multi DB and D.M.Copyright: C. Faloutsos (2001)37 google/page-rank algorithm Let A be the transition matrix (= adjacency matrix, row-normalized : sum of each row = 1) =

Multi DB and D.M.Copyright: C. Faloutsos (2001)38 google/page-rank algorithm A p = p =

Multi DB and D.M.Copyright: C. Faloutsos (2001)39 google/page-rank algorithm A p = p thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is row-normalized)

Multi DB and D.M.Copyright: C. Faloutsos (2001)40 Kleinberg/google - conclusions SVD helps in graph analysis: hub/authority scores: strongest left- and right- eigenvectors of the adjacency matrix random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix

Multi DB and D.M.Copyright: C. Faloutsos (2001)41 SVD - detailed outline... Case studies SVD properties more case studies –google/Kleinberg algorithms –query feedbacks Conclusions

Multi DB and D.M.Copyright: C. Faloutsos (2001)42 Query feedbacks [Chen & Roussopoulos, sigmod 94] sample problem: estimate selectivities (e.g., ‘how many movies were made between 1940 and 1945?’ for query optimization, LEARNING from the query results so far!!

Multi DB and D.M.Copyright: C. Faloutsos (2001)43 Query feedbacks Idea #1: consider a function for the CDF (cummulative distr. function), eg., 6-th degree polynomial (or splines, or whatever) year count, so far

Multi DB and D.M.Copyright: C. Faloutsos (2001)44 Query feedbacks GREAT Idea #2: adapt your model, as you see the actual counts of the actual queries year count, so far actual original estimate

Multi DB and D.M.Copyright: C. Faloutsos (2001)45 Query feedbacks year count, so far actual original estimate

Multi DB and D.M.Copyright: C. Faloutsos (2001)46 Query feedbacks year count, so far actual original estimate a query

Multi DB and D.M.Copyright: C. Faloutsos (2001)47 Query feedbacks year count, so far actual original estimate new estimate

Multi DB and D.M.Copyright: C. Faloutsos (2001)48 Query feedbacks Eventually, the problem becomes: - estimate the parameters a1,... a7 of the model - to minimize the least squares errors from the real answers so far. Formally:

Multi DB and D.M.Copyright: C. Faloutsos (2001)49 Query feedbacks Formally, with n queries and 6-th degree polynomials: =

Multi DB and D.M.Copyright: C. Faloutsos (2001)50 Query feedbacks where x i,j such that Sum (x i,j * a i ) = our estimate for the # of movies and b j : the actual =

Multi DB and D.M.Copyright: C. Faloutsos (2001)51 Query feedbacks In matrix form: = X a = b 1st query n-th query

Multi DB and D.M.Copyright: C. Faloutsos (2001)52 Query feedbacks In matrix form: X a = b and the least-squares estimate for a is a = V   U T b according to property C(1) (let X = U  V T )

Multi DB and D.M.Copyright: C. Faloutsos (2001)53 Query feedbacks - enhancements the solution a = V   U T b works, but needs expensive SVD each time a new query arrives GREAT Idea #3: Use ‘Recursive Least Squares’, to adapt a incrementally. Details: in paper - intuition:

Multi DB and D.M.Copyright: C. Faloutsos (2001)54 Query feedbacks - enhancements Intuition: x b a 1 x + a 2 least squares fit

Multi DB and D.M.Copyright: C. Faloutsos (2001)55 Query feedbacks - enhancements Intuition: x b a 1 x + a 2 least squares fit new query

Multi DB and D.M.Copyright: C. Faloutsos (2001)56 Query feedbacks - enhancements Intuition: x b a 1 x + a 2 least squares fit new query a’ 1 x + a’ 2

Multi DB and D.M.Copyright: C. Faloutsos (2001)57 Query feedbacks - enhancements the new coefficients can be quickly computed from the old ones, plus statistics in a (7x7) matrix (no need to know the details, although the RLS is a brilliant method)

Multi DB and D.M.Copyright: C. Faloutsos (2001)58 Query feedbacks - enhancements GREAT idea #4: ‘forgetting’ factor - we can even down-play the weight of older queries, since the data distribution might have changed. (comes for ‘free’ with RLS...)

Multi DB and D.M.Copyright: C. Faloutsos (2001)59 Query feedbacks - conclusions SVD helps find the Least Squares solution, to adapt to query feedbacks (RLS = Recursive Least Squares is a great method to incrementally update least-squares fits)

Multi DB and D.M.Copyright: C. Faloutsos (2001)60 SVD - detailed outline... Case studies SVD properties more case studies –google/Kleinberg algorithms –query feedbacks Conclusions

Multi DB and D.M.Copyright: C. Faloutsos (2001)61 Conclusions SVD: a valuable tool given a document-term matrix, it finds ‘concepts’ (LSI)... and can reduce dimensionality (KL)... and can find rules (PCA; RatioRules)

Multi DB and D.M.Copyright: C. Faloutsos (2001)62 Conclusions cont’d... and can find fixed-points or steady-state probabilities (google/ Kleinberg/ Markov Chains)... and can solve optimally over- and under- constraint linear systems (least squares / query feedbacks)

Multi DB and D.M.Copyright: C. Faloutsos (2001)63 References Brin, S. and L. Page (1998). Anatomy of a Large- Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf. Chen, C. M. and N. Roussopoulos (May 1994). Adaptive Selectivity Estimation Using Query Feedback. Proc. of the ACM-SIGMOD, Minneapolis, MN.

Multi DB and D.M.Copyright: C. Faloutsos (2001)64 References cont’d Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press.