Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
BBM: Bayesian Browsing Model from Petabyte-scale Data Chao Liu, MSR-Redmond Fan Guo, Carnegie Mellon University Christos Faloutsos, Carnegie Mellon University.
Chen Cheng1, Haiqin Yang1, Irwin King1,2 and Michael R. Lyu1
Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
©2013 Microsoft Corporation. SharePoint Innovations LLC. All rights reserved. M ODULE 13 – S HARE P OINT 2010 O VERVIEW OF S HARE P OINT 2013.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
© 2008 The MathWorks, Inc. ® ® Parallel Computing with MATLAB ® Silvina Grad-Freilich Manager, Parallel Computing Marketing
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Storage in Big Data Systems
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
Mining High Utility Itemset in Big Data
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
SAND C 1/17 Coupled Matrix Factorizations using Optimization Daniel M. Dunlavy, Tamara G. Kolda, Evrim Acar Sandia National Laboratories SIAM Conference.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Chao Liu Internet Services Research Center Microsoft Research-Redmond.
Distributed Nonnegative Matrix Factorization for Web- Scale Dyadic Data Analysis on MapReduce Challenge : the scalability of available tools Definition.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Scalable and Coordinated Scheduling for Cloud-Scale computing
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Kijung Shin Jinhong Jung Lee Sael U Kang
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Unsupervised Streaming Feature Selection in Social Media
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Mining of Massive Datasets Edited based on Leskovec’s from
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Scaling up R computation with high performance computing resources.
Book web site:
1 / 24 Distributed Methods for High-dimensional and Large-scale Tensor Factorization Kijung Shin (Seoul National University) and U Kang (KAIST)
Big Data is a Big Deal!.
Distributed Computation Framework for Machine Learning
Optimizing Big-Data Queries using Program Synthesis
Community Distribution Outliers in Heterogeneous Information Networks
CMPT 733, SPRING 2016 Jiannan Wang
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Scaling up Link Prediction with Ensembles
Presentation transcript:

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond

Internet Services Research Center (ISRC) Advancing the state of the art in online services Dedicated to accelerating innovations in search and ad technologies Representing a new model for moving technologies quickly from research projects to improved products and services Thursday, 04/29/2010Friday, 04/30/ :30~12:00pm: Data Analysis & Efficiency Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce 11:00~12:30pm: Query Analysis Exploring Web Scale Language Models for Search Query Processing Building Taxonomy of Web Search Intents for Name Entity Queries Optimal Rare Query Suggestion With Implicit User Feedback 1:30~3:00pm: Information Extraction Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries 1:30~3:00pm: Infrastructure 2 Large-scale Bot Detection for Search Engines

Dyadic Data on the Web Web abounds with dyadic data – Web search: term by document, query by clickedURL, web linkage, … – Advertising: query by ad, bid term by ad, user by ad, … – Social media: tag by image, user by community, friendship graph, … Common characteristics – Good source for discovering latent relationships – High dimensionality, sparse, nonnegative, dynamic

Nonnegative Matrix Factorization (NMF) Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006] – Interpretable dimensionality reduction [Lee & Seung, 1999] – Document clustering [Shahnaz et al., 2006, Xu et al, 2006] Challenge: Can we scale NMF to million-by-million matrices

NMF Algorithm [Lee & Seung, 2000]

Parallel NMF [Robila & Maciak, 2006] Parallelism on multi-core machines – Partition along the long dimension for parallelism – Assuming all matrices can be held in shared memory

Distributed NMF Data Partition: A, W and H across machines ………….....

Copmuting DNMF: The Big Picture

… … … … Map-I Reduce-I Map-II Reduce-II Map-III Map-IV Map-V … … … … … … Reduce-III Reduce-V

… … … Map-I Reduce-I Map-II Reduce-II … … …

… … Map-III Map-IV … Reduce-III

… Map-V … … … … Reduce -V

… … … … Map-I Reduce-I Map-II Reduce-II Map-III Map-IV Map-V … … … … … … Reduce-III Reduce-V

Experimental Evaluation Synthesized data on a sandbox cluster – No interference from other jobs – Performance with various parameters Real-world data on a commercial cluster – Real-world scalability

Synthesized Data on Sandbox Cluster A Hadoop cluster with 8 workers in total – Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB memory, 150G hard drive – V: Number of workers in cluster Matrix simulator – Generate m-by-n matrix with sparsity δ – k: factorization dimensionality – Defaults:

Computation Breakdown dominates the computation is lightweight The sparser, the faster

Performance w.r.t. Parameters Linear to m×n×δ Linear to factorization dimension k Sub-ideal speedup w.r.t. cluster size V

Scalability on Real-world Data User-by-Website matrix – Browsed URLs of opt-in users, represented by UID – URLs trimmed to site level --> Experiments on Microsoft SCOPE – SCOPE: Structure Computations Optimized for Parallel Execution [Chaiken et al., VLDB’08]

Executions w.r.t. Iterations Observations – Longer total elapse time – Shorter time per iteration Reason – Overlapped computation across iterations Iterations Normalized Elapse Time

Scalability w.r.t. Matrix Size 3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

Conclusion NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web NMF is admissible to MapReduce Distributed NMF solves the scalability challenge Applications down the road

Q&A Thank You!