1
RIOT: I/O-Efficient Numerical Computing in Yi Zhang Herodotos Herodotou Jun Yang
What is R? R: an open-source language/environment – Statistical computing, graphics – Comprehensive R Archive Network 1639 packages as of Dec 08 – Interpretive execution – High-level constructs Arrays, matrices Code example: Common to languages for numerical/statistical computing a <- 1:100 … d <- a+b^2+c 3
Big-Data Challenge R assumes all data in main memory – If not, VM starts swapping data from/to disk – Excessive I/O, poor performance – Example: 4 # n points with coordinates stored in x[1:n], y[1:n] (1) d <- sqrt((x-xs)^2+(y-ys)^2)+sqrt((x-xe)^2+(y-ye)^2) (2) s <- sample(n, 100) # draw 100 samples from 1:n (3) z <- d[s] # extract elements of d whose indices are in s S(xs,ys) E(xe,ye) x y x y x-xs x (x-xs)^2 y 1 st sqrt (x-xe)^2 y-ye y x … …… memory swap/ paging file x,y
Opportunities Avoiding intermediate results – Multiple large intermediate results are generated – Can we avoid them without hand-coding loops? for (i in 1:n) { d[i] <- sqrt((x[i]-xs)^2+…)+… } Deferred and selective evaluation – Each expression is evaluated in full immediately – Can we defer evaluation until really necessary? Just compute the 100 elements from d picked by s 5
Existing Solutions Rewrite and hand-optimize code – Tedious, not quite reusable Use I/O-efficient libraries – SOLAR [Toledo’96], DRA [Nieplocha’96], etc. – But efficient individual operations are not enough Build/extend a DB – RasDaMan [Baumann’99], AML [Marathe’02], ASAP [Stonebraker’07], … – Must rewrite using a new language (often SQL) – Explicit boundary between DB and host language 6
SQL R with I/O Transparency Attain I/O efficiency without explicit user intervention Run legacy code with no or minimal modification No need to learn new languages/libraries No boundary between host language and backend processing 7
RIOT Implemented as an R package – New types, same interfaces: dbvector, dbmatrix, … – Uses R’s generics mechanism for transparency 8 Method overloading: setMethod(“+”,signature(e1=“dbvector”,e2=“dbvector”), function(e1,e2) {.Call(“add_dbvectors”,e1,e2) } ) 2 New class definition: setClass(“dbvector”, representation(size=“numeric”,…)) 1 Implementation: SEXP add_dbvectors(SEXP e1, SEXP e2){ … } 3
RIOT-DB: Hidden DB Backend A strawman solution: Map large arrays to DB tables – e.g. vector: V(i,v) ; matrix: M(i,j,v) – Computation query: a+b SELECT A.I,A.V+B.V FROM A,B WHERE A.I=B.I – Leverages power of DB only at intra-operation level! Key: Translate operations to view definitions – Build up larger and larger views a step at a time – Evaluate only when needed deferred evaluation – Query optimization selective evaluation + more – Iterator-style execution no intermediate results 9 CREATE VIEW T1(I,V) AS SELECT X.I,X.V-xs FROM X; d<-sqrt((x-xs)^2+(y-ys)^2)+… CREATE VIEW T2(I,V) AS SELECT T1.I, POW(T1.V,2) FROM T1; … CREATE VIEW D(I,V) AS SELECT T6.I, T6.V+T12.V FROM T6,T12 WHERE T6.I=T12.I; … z <- d[s] CREATE VIEW Z(I,V) AS SELECT S.I, D.V FROM D,S WHERE D.I=S.V; SELECT S.I, SQRT(POW(X.V-xs,2)+POW(Y.V-ys,2)) + SQRT(POW(X.V-xe,2)+POW(Y.V-ye,2)) FROM X,Y,S WHERE X.I=Y.I AND X.I=S.V
RIOT-DB Demo RIOT-DB built using with MyISAM engine 10
Plain R RIOT-DB variants – RIOT-DB/Strawman: use DB to store arrays and execute individual ops; no use of views to defer evaluation – RIOT-DB/MatNamed: use views, but compute/materialize every named object – RIOT-DB: full version; defer/optimize across statements Performance of RIOT-DB 11
Lessons Learned DB-style inter-operation optimization is really the key! Can we do better? – DB arrays carries too much overhead (ASAP [Stonebraker’07] ) Extra columns in V(i, v), M(i, j, v), …; more for higher dims – SQL & relational algebra may not be the right abstraction Advanced data layouts and complex ops are awkward RIOT: The Next Generation – A new expression algebra closer to numerical computation – Flexible array storage/layout options – Optimizations better tailored for numerical computation – … and more 12
RIOT Expression Algebra Analogous to the view mechanism, but more flexible Operators – +, –, *, /, [, … – A[idxRange]<-newVals: turn updates into functional ops Instead of in-place updates, log them & define A new over (A old,log) – X%*%Y (matrix multiply) etc.: built-in, for high-level opt. E.g. matrix chain multiplication: (XY)Z or X(YZ)? 13
Processing/Layout Optimization Matrix multiplication T=A(n 1 xn 2 ) B(n 2 xn 3 ), with fixed memory size M 14 R: Plain algorithm For each row i of A: For each column j of B: T[i,j] <- A[i,] * B[,j] BNLJ-inspired algorithm Read as many rows of A as possible: Use one block to scan B in column-major order: Update elements in T A x BT = A x BT = A x BT = Blocked algorithm Divide memory into 3 equal parts Divide each matrix into square blocks For each chunk (i,j) in T: For k=1…p: Read chunk (i,k) from A and chunk (k,j) from B chunk T(i,j) += A(i,k) %*% B(k,j) Write chunk T(i,j) RIOT-DB Hashjoin-sort-aggregate Optimal I/O cost: n 1 n 2 n 3 /(BM 1/2 )
Conclusion I/O efficiency can be added transparently – Ditch SQL at user level for broader impact! DB-style inter-operation optimization is critical – Need to go beyond developing I/O-efficient algorithms and libraries Integration of DB and programming languages – Lots of interesting analogies and new opportunities 15
Q&A 16 RIOT photos by Zack Gold (