Dataflow for High Performance Computing

Name: Dataflow for High Performance Computing
Uploaded: 2017-11-26T16:36:30+00:00
Duration: PTM10S10
Channel: Jody Franklin
Description: Dataflow for High Performance Computing

Dataflow for High Performance Computing
Leandro Marzulo Universidade do Estado do Rio de Janeiro

No more free lunch… Can’t buy a new processor and expect to improve performance automatically. Parallel programming is a must! Average programmers don’t know how to do it Parallel implementation may not scale Synchronization Heterogeneous Systems So many devices – CPU, GPU, Xeon Phi, FPGA … So many libraries/languages – CUDA, OpenCL, TBB, OpenMP, MPI, Pthreads, VHDL… TOO MUCH TO LEARN!

The industry is investing!!!
Sweet times ahead.. Time to think out of the box To experiment with different stuff To revisit old concepts To rethink the way we teach programming To connect to different fields and research groups The industry is investing!!!

Just because it feels natural!
Why Dataflow? Just because it feels natural!

Dataflow x Von Neumann Characteristic Dataflow Von Neumann Register File ✖ ✔ Program Counter Control Flow Steer (one per operand) Branches and Jumps Parallelism Natural (Parallelism Explosion) - Pipeline - Branch Prediction - Tomasulo - ROB … Language requirements Functional (no side effects) * Nonrestrictive Compilation difficulties (specially loops and functions) Several architectural specific optimizations * Wavescalar and its wave-ordering annotation scheme

Dataflow Revives! TERAFLUX (Unisi, BSC, Microsoft, HP, …) OmpSS (BSC)
Language Compiler Simulator (no actual HW yet) OmpSS (BSC) Heterogeneous TBB Flowgraph (Intel) Create and connect nodes Associate them to Lambda Functions Inject starter operands

Maxeler Static Dataflow – DAGs (mostly)
FPGA based – DFE (DataFlow Engine) Michael Flynn – MPP / SBAC-PAD 2014 Keynote More performance requires more effort (Flynn’s words) Compiler – Dataflow Graph in FPGA Galava DFE – Academic version (USD 4999) 500 multipliers 12 GB RAM PCI-E

DFEs shared over Infiniband Low latency connectivity
Maxeler - Products CPUs plus DFEs Intel Xeon CPU cores and up to 6 DFEs with 288GB of RAM DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation of DFEs to CPU servers Low latency connectivity Intel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet connections MaxWorkstation Desktop development system MaxCloud On-demand scalable accelerated compute resource, hosted in London

Maxeler - RTM 3U System Performance = 80 x 16 core Intel nodes!
1U traditional CPU node 2 x MPC-X 2000 (16 DFEs) Less than 2.5KW power usage Performance = 80 x 16 core Intel nodes! 27x space reduction 15x power consumption reduction 5x improvement on total cost of ownership There are other similar examples

TALM Talm is an Architecture and Language for Multithreading
Hybrid Dataflow/Von Neumann (coarse-grained) Trebuchet Virtual Machine THLL (Annotations – C) Couillard Compiler

TALM PE 1 Network . PE N Placement File Assembler Placement File
Creation Dataflow Binary Code Generation .c C Source .flb Dataflow Binary Loader Blocks Deffinition (THLL) Trebuchet Network Inst 3 Inst 50 Inst 52 PE 1 Inst 19 Inst 39 Inst 43 PE N . .fl Dataflow ASM Code .df.c Annotated Source Couillard Super-Instruction Code Extraction Dataflow Compilation .lib.c Super-instructions Source .so Super-instruction Library Library Compilation (gcc)

TALM – NW Code

TALM – Results - Blackscholes

TALM – Results - NW

TALM Extra Features Static Scheduler – Can use profiler information
Selective Workstealing – Custom heuristic Memory Speculation Transactional Memories Distributed Control – Commit Graph Avoid manual synchronization (dummy edges) No Compiler Support yet Error Detection and Recovery Redundant execution Distributed Control – in the graph Can have super-instructions in CUDA Compiler support needed (data movements)

Sucuri A minimalistic Dataflow Programing Library for Python
Transparent Execution on Clusters Mpi_enable = TRUE Need to obey DF principles – All data treated as operands Python serializes objects – easy implementation Main Classes Scheduler – Pool of tasks Graph – Container Nodes – Related to functions

Sucuri - Architecture

Sucuri - Pipeline Create a Graph Create a Scheduler Create Nodes
Add nodes to Graph Connect Nodes Start Scheduler

Sucuri – Results - LCS

Ongoing Work TALM Sucuri Both Compiler Improvements Cluster Version
Fork/Join Graph WavefrontGraph TALM Compiler Improvements Cluster Version Placement Improvements Sucuri Node Galery Graph Templates Better scheduler Both Full GPU Support FPGA Support Multiple implementations for the same task! Applications and users! Image Filter Node

Our Dataflow Research Group
Leandro Marzulo (UERJ) Tiago Alves Felipe França (UFRJ) Sandip Kundu (UMASS) Vítor Santos Costa (UPorto) Master Students (6 ongoing, 1 finished): Brunno Goldstein – UFRJ Leandro Santiago – UFRJ Marcos Paulo Rocha – UFRJ Leandro Rouberte – UFRJ Alexandre Machado – UERJ Julio Ho - UERJ Alexandre Sardinha – Finished his Master – Petrobras Undergrad students (UERJ) 6 finished – 3 are Master students now 11 ongoing

Questions?

TALM – Results - RT

Sucuri – Hierarchical reduction

Sucuri - Wavefront

Dataflow for High Performance Computing

Similar presentations

Presentation on theme: "Dataflow for High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dataflow for High Performance Computing

Similar presentations

Presentation on theme: "Dataflow for High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback