Download presentation
Presentation is loading. Please wait.
1
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond
2
Internet Services Research Center (ISRC) Advancing the state of the art in online services Dedicated to accelerating innovations in search and ad technologies Representing a new model for moving technologies quickly from research projects to improved products and services Thursday, 04/29/2010Friday, 04/30/2010 10:30~12:00pm: Data Analysis & Efficiency Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce 11:00~12:30pm: Query Analysis Exploring Web Scale Language Models for Search Query Processing Building Taxonomy of Web Search Intents for Name Entity Queries Optimal Rare Query Suggestion With Implicit User Feedback 1:30~3:00pm: Information Extraction Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries 1:30~3:00pm: Infrastructure 2 Large-scale Bot Detection for Search Engines
3
Dyadic Data on the Web Web abounds with dyadic data – Web search: term by document, query by clickedURL, web linkage, … – Advertising: query by ad, bid term by ad, user by ad, … – Social media: tag by image, user by community, friendship graph, … Common characteristics – Good source for discovering latent relationships – High dimensionality, sparse, nonnegative, dynamic
4
Nonnegative Matrix Factorization (NMF) Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006] – Interpretable dimensionality reduction [Lee & Seung, 1999] – Document clustering [Shahnaz et al., 2006, Xu et al, 2006] Challenge: Can we scale NMF to million-by-million matrices
5
NMF Algorithm [Lee & Seung, 2000]
6
Parallel NMF [Robila & Maciak, 2006] Parallelism on multi-core machines – Partition along the long dimension for parallelism – Assuming all matrices can be held in shared memory
7
Distributed NMF Data Partition: A, W and H across machines ………….....
8
Copmuting DNMF: The Big Picture
9
… … … … Map-I Reduce-I Map-II Reduce-II Map-III Map-IV Map-V … … … … … … Reduce-III Reduce-V
10
… … … Map-I Reduce-I Map-II Reduce-II … … …
11
… … Map-III Map-IV … Reduce-III........
12
… Map-V … … … … Reduce -V
13
… … … … Map-I Reduce-I Map-II Reduce-II Map-III Map-IV Map-V … … … … … … Reduce-III Reduce-V
14
Experimental Evaluation Synthesized data on a sandbox cluster – No interference from other jobs – Performance with various parameters Real-world data on a commercial cluster – Real-world scalability
15
Synthesized Data on Sandbox Cluster A Hadoop cluster with 8 workers in total – Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB memory, 150G hard drive – V: Number of workers in cluster Matrix simulator – Generate m-by-n matrix with sparsity δ – k: factorization dimensionality – Defaults:
16
Computation Breakdown dominates the computation is lightweight The sparser, the faster
17
Performance w.r.t. Parameters Linear to m×n×δ Linear to factorization dimension k Sub-ideal speedup w.r.t. cluster size V
18
Scalability on Real-world Data User-by-Website matrix – Browsed URLs of opt-in users, represented by UID – URLs trimmed to site level http://www.cnn.com/breakingnews --> www.cnn.com http://www.cnn.com/breakingnewswww.cnn.com Experiments on Microsoft SCOPE – SCOPE: Structure Computations Optimized for Parallel Execution [Chaiken et al., VLDB’08]
19
Executions w.r.t. Iterations Observations – Longer total elapse time – Shorter time per iteration Reason – Overlapped computation across iterations Iterations Normalized Elapse Time
20
Scalability w.r.t. Matrix Size 3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values
21
Conclusion NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web NMF is admissible to MapReduce Distributed NMF solves the scalability challenge Applications down the road
22
Q&A Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.