Distributed Nonnegative Matrix Factorization for Web- Scale Dyadic Data Analysis on MapReduce Challenge : the scalability of available tools Definition : In general, dyadic data are the measurements on dyads, which are pairs of two elements coming from two sets. Feature : High-dimensional, Sparse, Nonnegative, Dynamic.
What we do In this paper, by observing a variety of Web dyadic data that conforms to different probabilistic distributions, we put forward a probabilistic NMF framework, which not only encompasses the two classic GNMF and PNMF, but also presents the Exponential NMF (ENMF) for modeling Web lifetime dyadic data. Furthermore, we scale up the NMF to arbitrarily large matrices on MapReduce clusters.
GNMF maximizing the likelihood of observing A w.r.t. W and H under the i.i.d. assumption is equivalent to minimizing
PNMF then maximizing the likelihood of observing A becomes to minimizing
ENMF maximizing the likelihood of observing A w.r.t. W and H becomes to minimizing
( a ) just like the cuda programming model, and (b) is map-reduce programming. (a) puts the full row or column in the shared memory, and goes on computing with multi-core GPU. So if we have a very long column that can’t put in the shared memory, (a) can’t work efficiently, because the access overhead becomes large.
The updating formula for H (Eqn. 1) is composed of three components:
Then computing Y=CH
Experiment The experiment is based on GNMF. Sandbox Experiment This is designed for a clear understanding the algorithm but not for showcasing the scalability. And all the reported time is the time taken just by one iteration. Experiment on real Web data The experiment is for investigating the effectiveness of this approach, with a particular focus on its scalability, because a method that does not scale well will be of limited utility in practice.
Sandbox Experiment We generate the sample matrices which has m rows and n columns. And the m = pow(2,17), n=pow(2, 16), sparsity=pow(2, -10) and sparsity = pow(2, -7).
Performance The time taken for one iteration needed more and more as the number of nonzero cells in A increase.
Performance The time taken increases as the value of k is added. And the slope for sparsity= pow(2,-10) is much smaller than that for sparsity= pow(2,-7). Both is smaller than 1.
Performance The ideal speedup is along the diagonal, which upper-bounds the practical speedup because of the Amdahl’s Law. On the matrix with = pow(2, -7), the speedup is nearly 5 when 8 workers are enabled. The dominant factor becomes extra overhead such as shuffle, as the sparsity decreases.
Experiment on real Web data We record a UID-by-website count matrix involving 17.8 million distinct UIDs and 7.24 million distinct websites, which is training set. For effective test, we take the top-1000 UIDs that have the largest number of associated websites, and this gives us distinct (UID, website) pairs. To measure the effectiveness, we randomly hold out a visited website for each UID, and mix it with another 99 un-visited sites. The 100 websites then constitute the test case for that UID. The goal is to check the rank of the holdout visited website among the 100 websites for each UID, and the overall effectiveness s measured by the average rank across the 1000 test cases, the smaller the better.
Scalability On the one hand, it shows that the elapse time increases linearly with the number of iterations, but on the other hand, we note that the average time per iteration becomes smaller when more iterations are executed in one job.
Scalability It reveals the linearity between the elapse time and the dimensionality k.
Scalability It lists how the algorithm scales with increasingly larger data sampled from increasingly longer time periods.
Conclusion