Sebastian Schelter, Venu Satuluri, Reza Zadeh

Sebastian Schelter, Venu Satuluri, Reza Zadeh
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning and Matrix Computations workshop in conjunction with NIPS 2014

Latent Factor Models Given M sparse n x m Returns U and V rank k
Applications Dimensionality reduction Recommendation Inference

Seem familiar? SVD! So why not just use SVD?

Problems with SVD (Feb 24, 2015 edition)
Resulting SVD basis vectors are *dense* Not terribly helpful if we’re assuming our parameters are sparse Not just assuming, RELYING on sparse parameters

Revamped loss function
g – global bias term bUi – user-specific bias term for user i bVj – item-specific bias term for item j prediction function p(i, j) = g + bUi + bVj + uTivj a(i, j) – analogous to SVD’s mij (ground truth) New loss function: g: captures the average interaction strength between a user and an item in the dataset User and item specific biases: model how strongly the interactions of certain users and items tend to deviate from the global average. Replace dot product from SVD with prediction function p(i, j) which takes the bias terms into account when predicting the strength of interaction between user and item a(i, j) is different enough to allow for imputation of missing values (transform observed entries)

Algorithm

Problems Resulting U and V, for graphs with millions of vertices, still equate to hundreds of gigabytes of floating point values. SGD is inherently sequential; either locking or multiple passes are required to synchronize.

Problem 1: size of parameters
Solution: Parameter Server architecture Partition factor matrices (U and V) over machines called parameter servers These are updated as the learner machines distributed the graph (data itself) and perform SGD For each observation, learner fetches corresponding factor vectors from parameter machines, updates them, and writes them back

Problem 2: simultaneous writes
Solution: …so what?

Lock-free concurrent updates?
Assumptions f is Lipshitz continuously differentiable f is strongly convex Ω (size of hypergraph) is small Δ (fraction of edges that intersect any variable) is small ρ (sparsity of hypergraph) is small Hypergraph induced by non-zero elements of each vector If hypergraph is small then there is small chance of collision between concurrent updates - If hypergraph is small then “stale” versions of x will not have large impact on concurrent updates elsewhere

Factorbird Architecture
Still use intelligent partitioning to reduce network latency between learners and parameters, and to further reduce collisions Partition M by either rows or columns across learners Can partition V or U identically (U -> rows, V -> columns) on the same machines, making updates local relative to M Authors co-located V with M In-degree distribution (V) has much higher skew than out-degree (U) distribution Gives higher reduction in potential conflicts, compared to localizing U

Parameter server architecture
Open source!

Factorbird Machinery memcached – Distributed memory object caching system finagle – Twitter’s RPC system HDFS – persistent filestore for data Scalding – Scala front-end for Hadoop MapReduce jobs Mesos – resource manager for learner machines

Factorbird stubs

Model assessment Matrix factorization using RMSE
Root-mean squared error SGD performance often a function of hyperparameters λ: regularization η: learning rate k: number of latent factors

[Hyper]Parameter grid search
aka “parameter scans:” finding the optimal combination of hyperparameters Parallelize! For a factorization of rank k, we do all the grid search parameters at once: c factor vectors U becomes m x (c * k) V becomes (c * k) x n

Experiments “RealGraph”
Not a dataset; a framework for creating graph of user-user interactions on Twitter The proposed framework, called RealGraph, produces a directed and weighted graph where the nodes are Twitter users, and the edges are labeled with interactions between a directed pair of users. Further, each directed edge also has a weight that is the probability of any interaction going from the edge source to the edge destination in the future. The framework learns a logistic regression based model using historical data and then scores the edge features using the model to produce the weight. Kamath, Krishna, et al. "RealGraph: User Interaction Prediction at Twitter." User Engagement Optimization KDD

Experiments Data: binarized adjacency matrix of subset of Twitter follower graph a(i, j) = 1 if user i interacted with user j, 0 otherwise All prediction errors weighted equally (w(i, j) = 1) 100 million interactions 440,000 [popular] users

Experiments 80% training, 10% validation, 10% testing
Training: for training biased factorization models Validation: for choosing the hyperparameters (grid search) Testing: plotted curves Avg = using only global bias (average edge strength) Avg + biases = using global and vertex-specific bias terms Avg + biases + factorization = incorporating factors

Experiments k = 2 Homophily
Two users will be “close” in this two-dimensional space if their follower vectors m_i and m_j are roughly linear combinations of each other - homophily! (aka “birds of a feather”)

Experiments Scalability of Factorbird large RealGraph subset
229M x 195M (44.6 quadrillion) 38.5 billion non-zero entries Single SGD pass through training set: ~2.5 hours ~ 40 billion parameters 50 instances of learner machines (16 cores and 16GB memory each) Run 2 passes of hyperparameter search

Important to note As with most (if not all) distributed platforms:

Future work Support streaming (user follows)
Simultaneous factorization Fault tolerance Reduce network traffic s/memcached/custom application/g Load balancing Simultaneous factorization of multiple matrices Incorporate different types of interactions in a single factorization model Asynchronous checkpointing of partitions with U and V matrices Reduce network traffic Compression Factor vector caching Biased edge sampling to use retrieved vectors for more than a single update Replace memcached with a customized application to achieve higher throughput and conduct true Hogwild-style updates on parameter servers Dynamic load balancing to mitigate effects of stragglers in overall runtime

Strengths Excellent extension of prior work Hogwild, RealGraph
Current and [mostly] open technology Hadoop, Scalding, Mesos, memcached Clear problem, clear solution, clear validation

Weaknesses Lack of detail, lack of detail, lack of detail
How does number of machines affect runtime? What were performance metrics of the large RealGraph subset? What were some of the properties of the dataset (when was it collected, how were edges determined, what does “popular” mean, etc)? How did other factorization methods perform by comparison? Lack of details likely a result of this being an internal Twitter project

Questions?

Sebastian Schelter, Venu Satuluri, Reza Zadeh

Similar presentations

Presentation on theme: "Sebastian Schelter, Venu Satuluri, Reza Zadeh"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sebastian Schelter, Venu Satuluri, Reza Zadeh

Similar presentations

Presentation on theme: "Sebastian Schelter, Venu Satuluri, Reza Zadeh"— Presentation transcript:

Similar presentations

About project

Feedback