Presentation is loading. Please wait.

Presentation is loading. Please wait.

Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote.

Similar presentations


Presentation on theme: "Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote."— Presentation transcript:

1 Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote

2 Parallel computing is changing Platforms –Dedicated clusters versus cloud, mobile Data –Structured (vector, matrix) versus unstructured (graphs) People –Small number of highly trained scientists versus large number of self-trained parallel programmers Old World New World

3 The Search for “Scalable” Parallel Programming Models Tension between productivity and performance –support large number of application programmers with small number of expert parallel programmers –performance comparable to hand-optimized codes Galois project –data-centric abstractions for parallelism and locality operator formulation –scalable shared-memory system 3 Joe Stephanie

4 What we have learned Abstractions for parallelism –Yesterday: computation-centric abstractions Loops or procedure calls that can be executed in parallel –Today: data-centric abstractions Operator formulation of algorithms Parallelization strategies –Yesterday: static parallelization is the norm Inspector-executor, optimistic parallelization etc. needed only when you lack information about algorithm or data structure –Today: optimistic parallelization is the baseline Inspector-executor, static parallelization etc. are possible only when algorithm has enough structure Applications –Yesterday: programs are monoliths, whole-program analysis is essential –Today: programs must be layered. Data abstraction is essential not just for software engineering but for parallelism.

5 Parallelism: Yesterday What does program do? –It does not matter. Where is parallelism in program? –Loop: do static analysis to find dependence graph Static analysis fails to find parallelism. –May be there is no parallelism in program? Thread-level speculation –Misspeculation and overheads limit performance –Misspeculation costs power and energy Mesh m = /* read in mesh */ WorkList wl; wl.add(m.badTriangles()); while (true) { if (wl.empty()) break; Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(); c.expand(); c.retriangulate(); m.update(c);//update mesh wl.add(c.badTriangles()); } Computation-centric view of parallelism

6 Parallelism: Today Parallelism: –Bad triangles whose cavities do not overlap can be processed in parallel –Parallelism must be found at runtime Data-centric view of algorithm –Active elements: bad triangles –Local view: operator applied to bad triangle: {Find cavity of bad triangle (blue); Remove triangles in cavity; Retriangulate cavity and update mesh;} –Global view: schedule –Algorithm = Operator + Schedule Parallel data structures –Graph –Worklist of bad triangles Delaunay mesh refinement Red Triangle: badly shaped triangle Blue triangles: cavity of bad triangle

7 Example: Graph analytics Single-source shortest-path problem Many algorithms – Dijkstra (1959) – Bellman-Ford (1957) – Chaotic relaxation (1969) – Delta-stepping (1998) Common structure : – Each node has distance label d – Operator: relax-edge(u,v) : if d[v] > d[u]+length(u,v) then d[v]  d[u]+length(u,v) – Active node: unprocessed node whose distance field has been lowered – Different algorithms use different schedules – Schedules differ in parallelism, locality, work efficiency G A B C D E F H 2 5 1 7 4 3 2 9 2 1 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ 2 5

8 Example: Stencil computation Jacobi iteration, 5-point stencil AtAt A t+1 //Jacobi iteration with 5-point stencil //initialize array A for time = 1, nsteps for in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j) Finite-difference computation Algorithm – Active nodes: nodes in A t+1 – Operator: five-point stencil – Different schedules have different locality Regular application – Grid structure and active nodes known statically – Application can be parallelized at compile- time “Data-centric multilevel blocking” Kodukula et al, PLDI 1997.

9 Operator formulation of algorithms Active element – Node /edge where computation is needed Operator – Computation at active element – Activity: application of operator to active element Neighborhood – Set of nodes/edges read/written by activity – Distinct usually from neighbors in graph Ordering : scheduling constraints on execution order of activities – Unordered algorithms: no semantic constraints but performance may depend on schedule – Ordered algorithms: problem-dependent order Amorphous data-parallelism – Multiple active nodes can be processed in parallel subject to neighborhood and ordering constraints : active node : neighborhood Parallel program = Operator + Schedule + Parallel data structure

10 Locality i1i1 i2i2 i3i3 i4i4 i5i5 Temporal locality: –Activities with overlapping neighborhoods should be scheduled close in time –Example: activities i 1 and i 2 Spatial locality: –Abstract view of graph can be misleading –Depends on the concrete representation of the data structure Inter-package locality: –Partition graph between packages and partition concrete data structure correspondingly –Active node is processed by package that owns that node 1123 2132 3.43.60.92.1 src dst val Concrete representation: coordinate storage Abstract data structure

11 Parallelization strategies: Binding Time Optimistic Parallelization (Time-warp) Interference graph (DMR, chaotic SSSP) Inspector-executor (Bellman-Ford) Static parallelization (stencil codes, FFT, dense linear algebra) Compile-time After input is given During program execution After program is finished 1 2 3 4 When do you know the active nodes and neighborhoods? “The TAO of parallelism in algorithms” Pingali et al, PLDI 2011

12 Galois system Ubiquitous parallelism: – small number of expert programmers (Stephanies) must support large number of application programmers (Joes) – cf. SQL Galois system: – Stephanie: library of concurrent data structures and runtime system – Joe: application code in sequential C++ Galois set iterator for highlighting opportunities for exploiting ADP Parallel program = Operator + Schedule + Parallel data structures Joe: Operator + Schedule Stephanie: Parallel data structures

13 Implementation of data-centric approach 13 Concurrent data structures main() …. for each …..{ ……. }..... Master thread Application Program i1i1 i2i2 i3i3 i4i4 i5i5 Application (Joe) program – Sequential C++ – Galois set iterator: for each New elements can be added to set during iteration Optional scheduling specification (cf. OpenMP) Highlights opportunities in program for exploiting amorphous data-parallelism Runtime system – Ensures serializability of iterations – Execution strategies Speculation Interference graphs

14 Hello graph Galois Program #include “Galois/Galois.h” #include “Galois/Graphs/LCGraph.h” struct Data { int value; float f; }; typedef Galois::Graph::LC_CSR_Graph Graph; typedef Galois::Graph::GraphNode Node; Graph graph; struct P { void operator()(Node n, Galois::UserContext & ctx) { graph.getData(n).value += 1; } }; int main(int argc, char** argv) { graph.structureFromGraph(argv[1]); Galois::for_each(graph.begin(), graph.end(), P()); return 0; } 14 Data structure Declarations Galois Iterator Operator

15 Galois vs Other Graph Frameworks Intel Study: Galois vs. Graph Frameworks “Navigating the maze of graph analytics frameworks” Nadathur et al SIGMOD 2014

16 Galois: Graph analytics Galois lets you code more effective algorithms for graph analytics than DSLs like PowerGraph (left figure) Easy to implement APIs for graph DSLs on top on Galois and exploit better infrastructure (few hundred lines of code for PowerGraph and Ligra) (right figure) “A lightweight infrastructure for graph analytics” Nguyen, Lenharth, Pingali (SOSP 2013)

17 PageRank 17 Topology-driven versions: top: topology driven (classic "Power Iteration" approach) GraphLab: topology driven (with filtering) Data-driven versions: dd-push: residual-push based, no pull at all dd-pp: pull for pagerank, push residual to decide to schedule dd-basic: data-driven pull Scheduling: none: FIFO prs: set favoring priority scheduler prt: OBIM

18 Recommender system W H A … 2 5 1 2 Items Users Given a database of movie/item ratings by a set of users, predict movies/items user may want to buy. 2 52 1

19 Evaluation Parameters Same SGD parameters, initial latent vectors between implementations ε(n) = alpha / (1 + beta * n 1.5 ) Datasets 19 Implementations Galois – 2 weeks to implement GraphLab – Standard GraphLab only supports GD – Implement 1D SGD with unsynchronized updates Nomad – MPI code ~1 year to implement ItemsUsersRatingsSparsity netflix17K480K99M1e-2 yahoo624K1M253M4e-4 Yun, Yu, Hsieh, Vishwanathan, Dhillon. NOMAD: Non-locking stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. VLDB 2013.

20 Throughput 20 Theoretical Peak: 363.2 GFLOPS

21 Convergence 21 20 Threads

22 Galois: Performance on SGI Ultraviolet

23 FPGA Tools Moctar & Brisk, “Parallel FPGA Routing based on the Operator Formulation” DAC 2014

24 Elixir: DSL for graph algorithms Graph Operators Schedules

25 SSSP: synthesized vs handwritten Input graph: Florida road network, 1M nodes, 2.7M edges

26 Relation to other parallel programming models Galois: –Parallel program = Operator + Schedule + Parallel data structure –Operator can be expressed as a graph rewrite rule on data structure Functional languages: –Semantics specified in terms of rewrite rules like  -reduction –Rules rewrite program, not data structures Logic programming: –(Kowalski) Algorithm = Logic + Control –Control ~ Schedule Transactions: –Activity in Galois has transactional semantics (atomicity, consistency, isolation) –But transactions are synchronization constructs for explicitly parallel languages whereas Joe programming model in Galois is sequential

27 Conclusions Yesterday: –Computation-centric view of parallelism Today: –Data-centric view of parallelism –Operator formulation of algorithms –Permits a unified view of parallelism and locality in algorithms –Joe/Stephanie programming model –Galois system is an implementation Tomorrow: –DSLs for different applications –Layer on top of Galois Joe: Operator + Schedule Stephanie: Parallel data structures Parallel program = Operator + Schedule + Parallel data structure

28 Intelligent Software Systems group (ISS) Faculty –Keshav Pingali, CS/ECE/ICES Research associates –Andrew Lenharth –Sree Pai PhD students –Amber Hassaan –Rashid Kaleem –Donald Nguyen –Dimitris Prountzos –Xin Sui –Gurbinder Singh Visitors from China, France, India, Italy, Poland, Portugal Home page: http://iss.ices.utexas.eduhttp://iss.ices.utexas.edu Funding: NSF, DOE,Qualcomm, Intel, NEC, NVIDIA…

29 More information Website –http://iss.ices.utexas.eduhttp://iss.ices.utexas.edu Download –Galois system for multicores –Lonestar benchmarks –All our papers

30 Nested ADP Two levels of parallelism –Activities can be performed in parallel if neighborhoods are disjoint Inter-operator parallelism –Activities may also have internal parallelism May update many nodes and edges in neighborhood Intra-operator parallelism Densely connected graphs (clique) –Single neighborhood may cover entire graph –Little inter-operator parallelism, lots of intra-operator parallelism –Dominant parallelism in dense matrix algorithms Sparse matrix factorization –Lot of inter-operator parallelism initially –Towards the end, graph becomes dense so need to switch to exploiting intra-operator parallelism i1i1 i2i2 i3i3 i4i4

31 Galois system Ubiquitous parallelism: –small number of expert programmers (Stephanies) must support large number of application programmers (Joes) –cf. SQL Galois system: –Library of concurrent data structures and runtime system written by expert programmers (Stephanies) –Application programmers (Joe) code in sequential C++ All concurrency control is in data structure library and runtime system –Wide variety of scheduling policies supported deterministic schedules also Algorithms Data structures Parallel program = Operator + Schedule + Parallel data structures Joe: Operator + Schedule Stephanie: Parallel data structures

32 Application program and Execution model 32 Concurrent data structures main() …. for each …..{ ……. }..... Master Joe Program i1i1 i2i2 i3i3 i4i4 i5i5

33 Application program and Execution model Sequential, OO model – Implemented in C++ Galois set iterators: – for each e in Set S do B(e) evaluate B(e) for each element in set S no a priori order on iterations set S may get new elements during execution – for each e in OrderedSet S do B(e) evaluate B(e) for each element in set S perform iterations in order specified by OrderedSet set S may get new elements during execution 33 Concurrent Data structure main() …. for each …..{ ……. }..... Master Joe Program i1i1 i2i2 i3i3 i4i4 i5i5

34 Evaluation of Throughput 34nomad with 40 threads on bgg does not converge

35 Recommender system Implementations –Galois 2D scheduling Synchronized updates –GraphLab Standard GraphLab only supports GD Implement 1D SGD with unsynchronized updates –Nomad From scratch distributed code 2D scheduling –Tile size is function of #threads Synchronized updates Machine –4 x 10 core (Xeon E7-4860) “Westmere” –2.27 GHz Datasets Parameters –Same SGD parameters, initial latent vectors between implementations –ε(n) = alpha / (1 + beta * n 1.5 ) –Handtuned tile size 35 ItemsUsersRatingsSparsity bgg47K109K6M1e-3 netflix17K480K99M1e-2 yahoo624K1M253M4e-4

36 Convergence 36 Error Runtime (s)

37 TAO analysis: algorithm abstraction : active node : neighborhood Dijkstra SSSP: general graph, data-driven, ordered, local computation Chaotic relaxation SSSP: general graph, data-driven, unordered, local computation Delaunay mesh refinement: general graph, data-driven, unordered, morph Jacobi: grid, topology-driven, unordered, local computation

38 TAO analysis: algorithm abstraction : active node : neighborhood Dijkstra SSSP: general graph, data-driven, ordered, local computation Chaotic relaxation SSSP: general graph, data-driven, unordered, local computation Delaunay mesh refinement: general graph, data-driven, unordered, morph Jacobi: grid, topology-driven, unordered, local computation

39 Example: Friend recommendations Facebook, LinkedIn,.. Facebook graph: – nodes: people (~1 billion) – edges: friends Friend recommendations – Candidates: friends of your friends who are not already your friends – Ranking candidates: Number of mutual friends Shared interests (attributes on nodes) …. Take-away: – key data structure is a large sparse graph – algorithms may involve path computations in graphs u v w a b


Download ppt "Keshav Pingali The University of Texas at Austin Parallel Program = Operator + Schedule + Parallel data structure SAMOS XV Keynote."

Similar presentations


Ads by Google