Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/2008 1
2 Goal
The Software Stack Windows Server Cluster Services Distributed Filesystem: Cosmos Dryad DryadLINQ Windows Server Large Vector Machine learning Data analysis 3
Dryad 4
Dryad Jobs RR XXX MMM XX M M Vertices (processes) Channels Output files Input files Stage M RR X 5
6 LINQ and C#
LINQ Collection collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 7
Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; DryadLINQ = LINQ + Dryad C# collection results C# 8 Vertex code Query plan (Dryad job) Data
Recall: The Software Stack Windows Server Cluster Services Distributed Filesystem: Cosmos Dryad DryadLINQ Windows Server Large Vector Machine learning Data analysis 9
Very Large Vector Library PartitionedVector 10 T Scalar TT T
Operations on Large Vectors: Map 1 11 U T T U f f f preserves partitioning
V Map 2 (Pairwise) 12 T U f V U T f
Map 3 (Vector-Scalar) 13 T U f V V U T f
Reduce (Fold) 14 UUU U f fff f UUU U
Linear Algebra 15 T U V =,, T
Linear Regression Data Find S.t. 16
Analytic Solution 17 X×X T Y×X T Σ X[0]X[1]X[2]Y[0]Y[1]Y[2] Σ [ ] -1 * A Map Reduce
Linear Regression Code 18 Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b));
Expectation Maximization lines 3 iterations shown
Understanding Botnet Traffic using EM 20 3 GB data 15 clusters 60 computers 50 iterations 9000 processes 50 minutes
Conclusions Dryad simplifies programming large clusters DryadLINQ = declarative programming for Dryad jobs The Large Vector library provides simple mathematical primitives on top of DryadLINQ Matlab-style coding for writing distributed numeric computations 21 Win Cluster Services Distributed Filesystem Dryad DryadLINQ Win Large Vector ML Data analysis
Backup Slides 22
Chaining 23 X×X T Y×X T Σ X[0]X[1]X[2]Y[0]Y[1]Y[2] Σ [ ] -1 * A ΣΣΣΣΣΣ
EM Structure 24 E stage Input size π σ μ All parameters