CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer Science Carnegie Mellon University Evangelos Papalexakis Abhay Harpale
CMU SCS U Kang (CMU) 2KDD 2012 Outline Problem Definition Algorithm Discoveries Conclusions
CMU SCS U Kang (CMU) 3KDD 2012 Background: Tensor Tensors (=multi-dimensional arrays) are everywhere Hyperlinks and anchor texts in Web graphs URL 1 URL 2 Anchor Text Java C++ C#
CMU SCS U Kang (CMU) 4KDD 2012 Background: Tensor Tensors (=multi-dimensional arrays) are everywhere Sensor stream (time, location, type) Predicates (subject, verb, object) in knowledge base “Barrack Obama is the president of U.S.” “Eric Clapton plays guitar” (26M) (48M) NELL (Never Ending Language Learner) data Nonzeros =144M
CMU SCS U Kang (CMU) 5KDD 2012 Problem Definition Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case
CMU SCS U Kang (CMU) 6KDD 2012 Problem Definition Q2: What are the important concepts and synonyms in a KB tensor? Q2.1: What are the dominant concepts in the knowledge base tensor? Q2.2: What are the synonyms to a given noun phrase? (26M) (48M) NELL (Never Ending Language Learner) data Nonzeros =144M
CMU SCS U Kang (CMU) 7KDD 2012 Outline Problem Definition Algorithm Discoveries Conclusions
CMU SCS U Kang (CMU) 8KDD 2012 Algorithm: Problem Definition Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case
CMU SCS U Kang (CMU) 9KDD 2012 Challenge Alternating Least Square (ALS) Algorithm : pseudo-inverse How to design fast MapReduce algorithm for the ALS? : Hadamard : Khatri-Rao (J=26M) (I=26M) (K=48M) Details
CMU SCS U Kang (CMU) 10KDD 2012 Main Idea 1. Ordering of Computation Our choice FLOPS (NELL data) Details
CMU SCS U Kang (CMU) 11KDD 2012 Main Idea 2. Avoiding Intermediate Data Explosion Size of Intermediate Data (NELL) - Naïve: 100 PB (J=26M) (I=26M) (K=48M) Details
CMU SCS U Kang (CMU) 12KDD 2012 Main Idea 2. Avoiding Intermediate Data Explosion Size of Intermediate Data (NELL) - Proposed: 1.5 GB Details Size of Intermediate Data (NELL) - Naïve: 100 PB (Before) (After)
CMU SCS U Kang (CMU) 13KDD 2012 Experiments GigaTensor solves 100x larger problem Number of nonzero = I / 50 (J) (I) (K) GigaTensor Tensor Toolbox Out of Memory 100x
CMU SCS U Kang (CMU) 14KDD 2012 Outline Problem Definition Algorithm Discoveries Conclusions
CMU SCS U Kang (CMU) 15KDD 2012 Discoveries: Problem Definition Q2: What are the important concepts and synonyms in a KB tensor? Q2.1: What are the dominant concepts in the knowledge base tensor? Q2.2: What are the synonyms to a given noun phrase? (26M) (48M) NELL (Never Ending Language Learner) data Nonzeros =144M
CMU SCS U Kang (CMU) 16KDD 2012 A2.1: Concept Discovery Concept Discovery in Knowledge Base
CMU SCS U Kang (CMU) 17KDD 2012 A2.1: Concept Discovery
CMU SCS U Kang (CMU) 18KDD 2012 A2.2: Synonym Discovery Synonym Discovery in Knowledge Base a1a1 a2a2 aRaR … (Given) noun phrase (Discovered) synonym 1 (Discovered) synonym 2
CMU SCS U Kang (CMU) 19KDD 2012 A2.2: Synonym Discovery
CMU SCS U Kang (CMU) 20KDD 2012 Outline Problem Definition Algorithm Discoveries Conclusions
CMU SCS U Kang (CMU) 21KDD 2012 Conclusion GigaTensor: scalable tensor decomposition algorithm for billion-length modes tensors Algorithm: avoid intermediate data explosion Discoveries: concept discovery and contextual synonym detection on KB tensor
CMU SCS U Kang (CMU) 22KDD 2012 Thank you !