Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.

Advertisements

Multilinear Algebra for Analyzing Data with Multiple Linkages

CMU SCS : Multimedia Databases and Data Mining Lecture #21: Tensor decompositions C. Faloutsos.

Dimensionality Reduction PCA -- SVD

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Latent Semantic Analysis

Hinrich Schütze and Christina Lioma

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University

CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway

Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

1 Conducting a Web Search: Problems & Algorithms Anna Rumshisky.

Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.

Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.

Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.

10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.

Kathryn Linehan Advisor: Dr. Dianne O’Leary

Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.

Singular Value Decomposition and Data Management

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.

Chapter 5: Information Retrieval and Web Search

Chapter 2 Dimensionality Reduction. Linear Methods

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.

Introduction to tensor, tensor factorization and its applications

Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Chapter 6: Information Retrieval and Web Search

Understanding The Semantics of Media Chapter 8 Camilo A. Celis.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.

SINGULAR VALUE DECOMPOSITION (SVD)

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.

CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,

R EPRESENTING H IGHER D IMENSIONAL A RRAYS INTO A G ENERALIZED T WO - DIMENSIONAL A RRAY Authors K.M. Azharul Hasan Md Abu Hanif Shaikh Dept. of Computer.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.

Vector Semantics Dense Vectors.

Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.

Singular Value Decomposition and its applications

Plan for Today’s Lecture(s)

Best pTree organization? level-1 gives te, tf (term level)

Provable Learning of Noisy-OR Networks

Large Graph Mining: Power Tools and a Practitioner’s guide

15-826: Multimedia Databases and Data Mining

Clustering of Web pages

Zhu Han University of Houston Thanks for Dr. Hung Nguyen’s Slides

LSI, SVD and Data Management

15-826: Multimedia Databases and Data Mining

Latent Semantic Analysis

Presentation transcript:

Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs TRICAP 2006, Chania, Greece, June 4-9, 2006

Tamara G. Kolda – TRICAP – June 6, p.2 Linear Algebra for Data with Linkages Circle-Square Matrix Circle-Circle Co-Link Matrix Square-Square Co-Link Matrix SVD Rank-k Approximation (k=2)

Tamara G. Kolda – TRICAP – June 6, p.3 Latent Semantic Indexing (LSI) for Text Retrieval S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. Using latent semantic analysis to improve access to textual information. In CHI '88, pp. 281–285, 1988 S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci., 41(6):391–407, 1990 M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Rev., 37(4):573–595, 1995 Term-Document Matrix “Car Service” Query SMART Retrieval System G. Salton (1971) LSI S. Dumais et al. (1988) Terms Documents car repair service military d1 d2 d3

Tamara G. Kolda – TRICAP – June 6, p.4 Applications of LSI Terms Documents car repair service military d1 d2 d3 Graph the Results using U 2 and V 2 Term-Document Similarities car service military repair d1d2d3 car service military repair carservicemilitaryrepair Term-Term Document-Document

Tamara G. Kolda – TRICAP – June 6, p.5 Caveats for LSI How to use  Term-document matrix weighting is critical! Local Weight Log f ij = frequency Global Term Weight Inverse Document Frequency N = total docs n i = # docs with term i Normalization Factor “Cosine”

Tamara G. Kolda – TRICAP – June 6, p.6 Citation/Link Analysis (Same Nodes) Link Matrix Co-Citation Matrix Co-Reference Matrix Hub Scores Authority Scores Doc 3 is the most important authority! Doc 1 is the most important hub! J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, Examples: Citation data, Web links

Tamara G. Kolda – TRICAP – June 6, p.7 Multiple Links? Suppose the connections between nodes are “labeled” in some fashion. In other words, we have meta-data on the connections. Can we somehow use multilinear algebra for link analysis?

Tamara G. Kolda – TRICAP – June 6, p.8 PARAFAC PARAFAC = Parallel Factors aka. CANDECOMP = Canonical Decomposition Higher-order analogue of the SVD Columns of A, B, and C are not orthonormal If R is minimal, then R is called the rank of the tensor (Kruskal 1977) Can have rank( X ) > min{ I, J, K } Often guaranteed to be a unique rank decomposition! = A I x R B J x R I x J x K C K x R R x R x R =++ + … I R. A. Harshman. Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16:1–84, 1970 J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of `Eckart-Young' decomposition. Psychometrika, 35:283–319, 1970.

Tamara G. Kolda – TRICAP – June 6, p.9 Many ways to write PARAFAC J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl., 18(2):95–138, “Kruskal Operator” Easy to write N-way case: “Tucker Operator”

Tamara G. Kolda – TRICAP – June 6, p.10 Properties of the Kruskal Operator PARAFAC core for a Tucker decomposition: Matricize (arbitrary map of indices to rows and columns): Mode-n matricize: Norm of a PARAFAC decomposition:

Tamara G. Kolda – TRICAP – June 6, p.11 PARAFAC for sparse data & approximations Our interest in the mathematical operations is motivated on two fronts  (1) Sparse computations  (2) Using tensor decompositions for approximation Ex: Considering how to efficiently implement PARAFAC-ALS for sparse data Can PARAFAC be used for the best rank-k approximation, rather than finding an exact decomposition (excepting noise)  What does it even mean in this case??

Tamara G. Kolda – TRICAP – June 6, p.12 Multilink Analysis using PARAFAC Quick Review: Tensors for Web Link Analysis  page x page x anchor text (TOPHITS) New work: Tensors for Publication Data Analysis  Case 1: doc x doc x similarity  Case 2: term x doc x author (HO-LSA??)

Tamara G. Kolda – TRICAP – June 6, p.13 TOPHITS: PARAFAC for Web Link Analysis A set of four hyperlinked web pages Graph representation shows basic connectivity Labeled edges capture context

Tamara G. Kolda – TRICAP – June 6, p.14 Analyzing Publication Data: Doc x Doc x Similarity Representation

Tamara G. Kolda – TRICAP – June 6, p.15 Computing Different Doc-Doc Similarities Computing term-based similarities (k=1,2,3) Computing author similarities (k=4) Enforces sparseness! 5022 papers unique terms (ignoring stop words, words with length less than 3 or greater than 30 characters, and words that appear less than 2 times)  Titles: 5164  Abstracts:  Keywords: authors 2659 citations

Tamara G. Kolda – TRICAP – June 6, p.16 PARAFAC for Doc x Doc x Similarity H = “hubs” A = “authorities” C = “connections” Rank-30 decomposition Central idea: Each triplet provides a core “grouping” of the data, i.e., a specific topic.

Tamara G. Kolda – TRICAP – June 6, p.17 Sample: Grouping 1

Tamara G. Kolda – TRICAP – June 6, p.18 Sample: Grouping 10

Tamara G. Kolda – TRICAP – June 6, p.19 Applications of the [ H,A,C ] Decomposition Latent document similarities  Calculate S = ½ HH T + ½ AA T Analyzing a body of work  c h = hub centroid, c a = authority centroid  s = ½ H c h + ½ A c a Disambiguation (EXAMPLE)  Calculate centroids using A (could also use H or A+H)  Calculate simiarlities of centroids Journal predicition  Use matrix A as features for input to a decision tree ensemeble classifier

Tamara G. Kolda – TRICAP – June 6, p.20 Example of Disambiguation Results Two authors with missing middle initials. 3 possible matches Matrix of Similarities

Tamara G. Kolda – TRICAP – June 6, p.21 Analyzing Publication Data: Term x Doc x Author Representation term doc author Form tensor X as: Element (i,j,k) is nonzero only if author k wrote document j using term i. 767 documents 2251 terms 1072 authors nonzeros Terms must appear in at least 3 documents and no more than 10% of all documents. Moreover, it must have at least 2 characters and no more than 30.

Tamara G. Kolda – TRICAP – June 6, p.22 Different Graph Interpretations for Term x Doc x Author term-doc with author links term-author with doc links author-doc with term links term-doc-author with links Term Doc Different author links represented by different colors

Tamara G. Kolda – TRICAP – June 6, p.23 Author Data is Too Sparse term doc author Result: Resulting tensor has just a few nonzero columns in each lateral slice. Experimentally, PARAFAC seems to overfit such data and not do a good job of “mixing” different authors.

Tamara G. Kolda – TRICAP – June 6, p.24 Idea: Use Tucker Transformation to Compress or, equivalently We transform the tensor to a smaller tensor as follows: This transformation forces the authors to be mixed and produces a dense result. Main problem: How to transform sparse tensor without creating dense intermediate results? (rank 75)(rank 50) Compute rank-25 PARAFAC on compressed tensor and transform.

Tamara G. Kolda – TRICAP – June 6, p.25 Tucker & PARAFAC Want PARAFAC for X in term x doc x author space First, apply dimensionality reduction to X to obtain Y  Y in “conceptual” space Next, compute PARAFAC on Y Finally, reassemble results to yield PARAFAC for X

Tamara G. Kolda – TRICAP – June 6, p.26 Three-Way Fingerprints Each of the Terms, Docs, and Authors has a rank-k (k=25) fingerprint from the PARAFAC approximation All items can be directly compared in “concept space” Thus, we can compare any of the following  Term-Term  Doc-Doc  Term-Doc  Author-Author  Author-Term  Author-Doc The fingerprints can be used as inputs for clustering, classification, etc.

Tamara G. Kolda – TRICAP – June 6, p.27 Sample Results: Term

Tamara G. Kolda – TRICAP – June 6, p.28 Sample Results: Term

Tamara G. Kolda – TRICAP – June 6, p.29 Sample Results: Author

Tamara G. Kolda – TRICAP – June 6, p.30 Summary & Future Work PARAFAC provides a technique for analyzing semantic graphs  Third dimension captures different connection types  Or may consider it as the interconnection of 3 different node types Analyzed journal articles using different tensor representations  Doc x Doc x Connection Need to make definitive case of why 3D is better than 2D  Term x Doc x Author Too sparse? Still working towards large-scale, sparse problems  Need implicit compression for PARAFAC  ~5M nonzeros Other decompositions?  Other hybrids  Symmetry

Tamara G. Kolda – TRICAP – June 6, p.31 Acknowledgments & More Information Thanks to…  Brett Bader, Danny Dunlavy, Philip Kegelmeyer  Web data: Joe Kenny, Travis Bauer et al., Ken Kolda  Journal data: Kevin Boyack  Graph viz: Ann Yoshimura Related papers  Algorithm xxx: MATLAB Tensor Classes for Fast Algorithm Prototyping (with B.W. Bader), ACM TOMS, to appear.  Multilinear algebra for analyzing data with multiple linkages (with D. Dunlavy and W. P. Kegelmeyer), Technical Report SAND , Apr  Temporal analysis of social networks using three-way DEDICOM (with B.W. Bader and R.Harshman), Technical Report SAND , Apr  Multilinear operators for higher-order decompositions. Technical Report SAND , Apr  The TOPHITS model for higher-order web link analysis (with B. Bader), in Proc. Workshop on Link Analysis, Counterterrorism and Security, SDM06, Apr  Higher-order web link analysis using multilinear algebra (with B.W.Bader), ICDM 2005, pp. 242–249, Nov Contact Info:   Thank You!