1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic,

Slides:

Advertisements

Similar presentations

Acceleration of Cooley-Tukey algorithm using Maxeler machine

Advertisements

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.

Selected MaxCompiler Examples

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Loops and cyclic graphs.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Computer Abstractions and Technology

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

1 Lecture 6 Performance Measurement and Improvement.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.

1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

1 Extending Summation Precision for Network Reduction Operations George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf Computer Architecture.

Sasa Stojanovic Veljko Milutinovic

© Janice Regan, CMPT 128, Feb CMPT 128: Introduction to Computing Science for Engineering Students Running Time Big O Notation.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

1/18 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic,

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

The CPU (or Central Processing Unit. Statistics Clock speed – number of instructions that can be executed per second Data width – The number of bits held.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

Implementation of the RSA Algorithm on a Dataflow Architecture

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

Acceleration of the SAT Problem

M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.

1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.

Sasa Stojanovic Veljko Milutinovic

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

Exploiting Parallelism

Sasa Stojanovic Veljko Milutinovic

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.

Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.

Contingency table analyses Miloš Radić 12/0010 1/14.

Miloš Kotlar 2012/115 Single Layer Perceptron Linear Classifier.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Polynomial Interpolation and Extrapolation

Code Optimization.

Lecture 3: MIPS Instruction Set

Architecture & Organization 1

CS203 – Advanced Computer Architecture

Cache Memory Presentation I

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Instruction Scheduling for Instruction-Level Parallelism

CSCI1600: Embedded and Real Time Software

Chapter 9: Virtual-Memory Management

Architecture & Organization 1

Microprocessor & Assembly Language

Overview of big data tools

Computer Evolution and Performance

Loop-Level Parallelism

Multicore and GPU Programming

Multicore and GPU Programming

CSCI1600: Embedded and Real Time Software

Husky Energy Chair in Oil and Gas Research

Presentation transcript:

1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic, Nenad Filipovic, Veljko Milutinovic,

MyWork in a NutShell 1.Introduction: Synergy of Physics and Logics 2.Problem: Moving LB to Maxeler 3.ExistingSolutions: None :) 4.Essence: Map+Opt(PACT) 5.Details: MyPhD 6.Analysis: BaU 7.Conclusions: 1000 (SPC) 2/21

Cooperation between BioIRC, UniKG and School of Electrical Engineering, UniBG 3/21

4/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach Expensive Quiet Fast Electrical 20m cord Environment-friendly Big-pack Wide-track Easy handling Reparation manual Reparation kit 5Y warranty Service in your town New-technology high-quality non-rusting heavy-duty precise-cutting recyclable blades streaming grass only to bag...

5/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach Expensive Quiet Electrical 20m cord Environment-friendly Big-pack Wide-track Easy handling Reparation manual Reparation kit 5Y warranty Service in your town New-technology high-quality non-rusting heavy-duty precise-cutting recyclable blades streaming grass only to bag...

Lattice Boltzmann for Blood Flow: A Software Engineering Approach 6/21

7/21 Structure of the Existing C-Code for a MultiCore Computer LS1 LS2 LS3 LS4 LS5 Statically: P / T = 100 / 400 = 25% => Only 100 lines to “kernelize” Dynamically: P / T = 99% => Potential speed-up factor is at most 100 LS – Looping structure LS1 and LS5 – Nested loops LS2, LS3, and LS4 – Simple loops P – lines to parallelize T – total number of lines

8/21 What Looping Structures to “Kernelize” All, because we like all data to reside on MAX3 prior to the execution start MAX CPU MAX CPU MAX CPU MAX CPU MAX CPU MAX CPU

9/21 What Looping Structures Bring what Benefits? LS1 moderate LS2, LS3, LS4 negligible, but must “kernelize” LS5 major FOR i = … k … n DO FOR i = … n DO T 0 T 1 T 2 T 3 T 4 T 0 T k T 2k T 3k OP1 OP2 OP3 OP4 OP5 OP6. OPk T k T k+1 T k+2 T k T 2k 1 result/clock MAX T 3k T 4k 1 result/k*clock CPU DFE doing k operations CPU doing only one

10/21 Why “Kernelizing” the Looping Structures? Conditions for “Kernelizing” Revisited Why?LS1LS2/3/4LS5 1.BigData O(n 2 ) 2.WORM+++ 3.Tolerance to latency+++ 4.Over 95% of run time in loops++ 5.Reusability of the data++ 6.Skills++++

11/21 Programming: Iteration #1 What to do with LS1..5? Direct MultiCore Data Choreography 1, 2, 3, 4,... Direct MultiCore Algorithm Execution ∑∑ + ∑ + ∑ + ∑ + ∑∑ Direct MultiCore Computational Precision: Double Precision Floating Point (64 bits)

12/21 Programming: Iteration #1 Potentials of Direct “Kernelization” Amdahl Low: limes(DFE Potential → ∞) = 100 Reality Estimate: limes(work → ) = N 99%1%1% 0% 1% x%x%

13/21 Pipelining the Inner Loops j i inputs output Kernel Kernel(s) Stream Middle Functions Kernels Kernel(s) Collide Manager

14/21 The Kernel for LS1: Direct Migration public class LS1Kernel extends Kernel { public LS1Kernel(KernelParameters parameters) { super(parameters); // Input HWVar f1new = io.scalarInput("f1new",hwFloat(8, 24)); HWVar f5new = io.scalarInput("f5new",hwFloat(8, 24)); HWVar f8new = io.scalarInput("f8new",hwFloat(8, 24)); HWVar f1 = io.input("f1", hwFloat(8, 24)); // j HWVar f2m = io.input("f2m", hwFloat(8, 24)); // j-1 HWVar f3 = io.input("f3", hwFloat(8, 24)); // j HWVar f4p = io.input("f4p", hwFloat(8, 24)); // j+1 HWVar f5m = io.input("f5m", hwFloat(8, 24)); // j-1 HWVar f6m = io.input("f6m", hwFloat(8, 24)); // j-1 HWVar f7p = io.input("f7p", hwFloat(8, 24)); // j+1 HWVar f8p = io.input("f8p", hwFloat(8, 24)); // j+1

15/21 The Kernel for LS5: Direct Migration // Do the summations needed to evaluate the density and components of velocity HWVar ro = f0 + f1 + f2 + f3 + f4 + f5 + f6 + f7 + f8; HWVar rovx = f1 - f3 + f5 - f6 - f7 + f8; HWVar rovy = f2 - f4 + f5 + f6 - f7 - f8; HWVar vx = rovx/ro; HWVar vy = rovy/ro; // Also load the velocity magnitude into plotvar - this is what we will // display using OpenGL later HWVar v2x = vx * vx; HWVar v2y = vy * vy; HWVar plotvar = KernelMath.sqrt(v2x + v2y); HWVar v_sq_term = 1.5f*(v2x + v2y); // Evaluate the local equilibrium f values in all directions HWVar vxmvy = vx - vy; HWVar vxpvy = vx + vy; HWVar rortau = ro * rtau; HWVar rortaufaceq2 = rortau * faceq2; HWVar rortaufaceq3 = rortau * faceq3; HWVar vxpvyp3 = 3.f*vxpvy; HWVar vxmvyp3 = 3.f*vxmvy; HWVar vxp3 = 3.f*vx; HWVar vyp3 = 3.f*vy; HWVar v2xp45 = 4.5f*v2x; HWVar v2yp45 = 4.5f*v2y; HWVar mv_sq_term = 1.f - v_sq_term; HWVar mv_sq_termpv2xp45 = mv_sq_term + v2xp45; HWVar mv_sq_termpv2yp45 = mv_sq_term + v2yp45; HWVar vxpvyp45vxpvy = 4.5f*vxpvy*vxpvy; HWVar vxmvyp45vxmvy = 4.5f*vxmvy*vxmvy; HWVar mv_sq_termpvxpvyp45vxpvy = mv_sq_term + vxpvyp45vxpvy; HWVar mv_sq_termpvxmvyp45vxmvy = mv_sq_term - vxmvyp45vxmvy; HWVar f0eq = rortau * faceq1 * mv_sq_term; HWVar f1eq = rortaufaceq2 * (mv_sq_termpv2xp45 + vxp3); HWVar f2eq = rortaufaceq2 * (mv_sq_termpv2yp45 + vyp3); HWVar f3eq = rortaufaceq2 * (mv_sq_termpv2xp45 - vxp3); HWVar f4eq = rortaufaceq2 * (mv_sq_termpv2yp45 - vyp3); HWVar f5eq = rortaufaceq3 * (mv_sq_termpvxpvyp45vxpvy + vxpvyp3); HWVar f6eq = rortaufaceq3 * (mv_sq_termpvxmvyp45vxmvy - vxmvyp3); HWVar f7eq = rortaufaceq3 * (mv_sq_termpvxpvyp45vxpvy - vxpvyp3); HWVar f8eq = rortaufaceq3 * (mv_sq_termpvxmvyp45vxmvy + vxmvyp3);

16/21 Programming: Iteration #2 Ideas for Additional Speedup (a) Better Data Choreography 5x x 5x Estimate: 1.2 X Speed-up (as seen from the drawing above)

17/21 Programming: Iteration #3 Ideas for Additional Speedup (b) Algorithmic Changes: ∑∑ + ∑ + ∑ + ∑ + ∑∑ → ∑∑ + ∑ + ∑∑ Explanation: As seen from the previous drawing, LS2 and LS3 can be integrated with LS1 Estimate: 1.6

18/21 Programming: Iteration #4 Ideas for Additional Speedup (c) Precision Changes: LUT (Double-precision floating point, 64) = 500 LUT (Maxeler-precision floating point, 24) = 24 Explanation: With less precision, hardware complexity can be reduced by a factor of about 20. Increasing number of iterations 4 times brings approximately similar precision, much faster. Estimate: Factor = (500/24)/4 ≈ 5 This is the only action, before which an topic expert has to be consulted!

19/21 Lattice Boltzman

20/21 Results: SPTC ≈ 1000x “Maxeler’s technology enables organizations to speed up processing times by 20-50x, with over 90% reduction in energy usage and over 95% reduction in data centre space”. Speedup factor: 1.2 x 1.6 x 5 x N ≈ 10N - Precisely Power reduction factor(i7/MAX3) = 17.6 / (MAX2 / MAX3) ≈ 10 - Precisely: the WallCord method Transistor count reduction factor = i7 / MAX3 - Precisely: about 20 Cost reduction factor: x - Precisely: depends on production volumes

Q&A: Hawaii Tahiti 10km/h ! 30km/h !!! 21/21