1/18 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic,

Slides:

Advertisements

Similar presentations

Acceleration of Cooley-Tukey algorithm using Maxeler machine

Advertisements

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Loops and cyclic graphs.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Computer Abstractions and Technology

1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic,

1 Lecture 6 Performance Measurement and Improvement.

Copyright 2008 Koren ECE666/Koren Part.9b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Energy Model for Multiprocess Applications Texas Tech University.

1 Extending Summation Precision for Network Reduction Operations George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf Computer Architecture.

Dr. Konstantinos Tatas ACOE201 – Computer Architecture I – Laboratory Exercises Background and Introduction.

Parallel Programming in.NET Kevin Luty.  History of Parallelism  Benefits of Parallel Programming and Designs  What to Consider  Defining Types of.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

Sasa Stojanovic Veljko Milutinovic

© Janice Regan, CMPT 128, Feb CMPT 128: Introduction to Computing Science for Engineering Students Running Time Big O Notation.

Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

High Performance Linear Transform Program Generation for the Cell BE

The CPU (or Central Processing Unit. Statistics Clock speed – number of instructions that can be executed per second Data width – The number of bits held.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Parallelizing Iterative Computation for Multiprocessor Architectures Peter Cappello.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

Implementation of the RSA Algorithm on a Dataflow Architecture

Acceleration of the SAT Problem

M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.

Sasa Stojanovic Veljko Milutinovic

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Exploiting Parallelism

Sasa Stojanovic Veljko Milutinovic

Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Page 1/8 Introduction to Maxeler Computing Veljko Milutinovic,

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Contingency table analyses Miloš Radić 12/0010 1/14.

Miloš Kotlar 2012/115 Single Layer Perceptron Linear Classifier.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Luka Petrović 69/2012 1/12. The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the greek letter sigma) The formula is.

Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong.

Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.

Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong

Polynomial Interpolation and Extrapolation

Code Optimization.

Introduction Super-computing Tuesday

Topics discussed in this section:

FPGAs in AWS and First Use Cases, Kees Vissers

Instruction Scheduling for Instruction-Level Parallelism

CSCI1600: Embedded and Real Time Software

Milos Kotlar1 Veljko Milutinovic2,3,4

Print the following triangle, using nested loops

Multicore and GPU Programming

The University of Adelaide, School of Computer Science

Multicore and GPU Programming

CSCI1600: Embedded and Real Time Software

Husky Energy Chair in Oil and Gas Research

Topics discussed in this section:

Presentation transcript:

1/18 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic, Nenad Filipovic, Veljko Milutinovic,

2/18 Lattice Boltzmann for Blood Flow: A Software Engineering Approach Expensive Quiet Fast Electrical 20m cord Environment-friendly Big-pack Wide-track Easy handling Reparation manual Reparation kit 5Y warranty Service in your town New-technology high-quality non-rusting heavy-duty precise-cutting recyclable blades streaming grass only to bag...

3/18 Lattice Boltzmann for Blood Flow: A Software Engineering Approach Expensive Quiet Electrical 20m cord Environment-friendly Big-pack Wide-track Easy handling Reparation manual Reparation kit 5Y warranty Service in your town New-technology high-quality non-rusting heavy-duty precise-cutting recyclable blades streaming grass only to bag...

4/18 Structure of the Existing C-Code for a MultiCore Computer LS1 LS2 LS3 LS4 LS5 Statically: P / T = 100 / 400 = 25% => Only 100 lines to “kernelize” Dynamically: P / T = 99% => Potential speed-up factor is at most 100 LS – Looping structure LS1 and LS5 – Nested loops LS2, LS3, and LS4 – Simple loops P – lines to parallelize T – total number of lines

5/18 What Looping Structures to “Kernelize” All, because we like all data to reside on MAX3 prior to the execution start MAX CPU MAX CPU MAX CPU MAX CPU MAX CPU MAX CPU

6/18 What Looping Structures Bring what Benefits? LS1 moderate LS2, LS3, LS4 negligible, but must “kernelize” LS5 major FOR i = … k … n DO FOR i = … n DO T 0 T 1 T 2 T 3 T 4 T 0 T k T 2k T 3k OP1 OP2 OP3 OP4 OP5 OP6. OPk T k T k+1 T k+2 T k T 2k 1 result/clock MAX T 3k T 4k 1 result/k*clock CPU FPGA doing k operations CPU doing only one

7/18 Why “Kernelizing” the Looping Structures? Conditions for “Kernelizing” Revisited Why?LS1LS2/3/4LS5 1.BigData O(n 2 ) 2.WORM+++ 3.Tolerance to latency+++ 4.Over 95% of run time in loops++ 5.Reusability of the data++ 6.Skills++++

8/18 Programming: Iteration #1 What to do with LS1..5? Direct MultiCore Data Choreography 1, 2, 3, 4,... Direct MultiCore Algorithm Execution ∑∑ + ∑ + ∑ + ∑ + ∑∑ Direct MultiCore Computational Precision: Double Precision Floating Point (64 bits)

9/18 Programming: Iteration #1 Potentials of Direct “Kernelization” Amdahl Low: limes(FPGA Potential → ∞) = 100 Reality Estimate: limes(x → ) = N 95%5% 0% 5% x%x%

10/18 Pipelining the Inner Loops j i inputs output Kernel Kernel(s) Stream Middle Functions Kernels Kernel(s) Collide Manager

11/18 The Kernel for LS1: Direct Migration

12/18 The Kernel for LS5: Direct Migration

13/18 Programming: Iteration #2 Ideas for Additional Speedup (a) Better Data Choreography 5x x 5x Estimation: 1.2 X Speed-up (as seen from Figure)

14/18 Programming: Iteration #3 Ideas for Additional Speedup (b) Algorithmic Changes: ∑∑ + ∑ + ∑ + ∑ + ∑∑ → ∑∑ + ∑ + ∑∑ Explanation: As seen from the previous figure, LS2 and LS3 can be integrated with LS1 Estimation: 1.6 (obvious from Formulae)

15/18 Programming: Iteration #4 Ideas for Additional Speedup (c) Precision Changes: LUT (Double-precision floating point, 64) = 500 LUT (Maxeler-precision floating point, 24) = 24 Explanation: With less precision, hardware complexity can be reduced by a factor of about 20, while increasing iteration count 4 times brings approximately similar precision, much faster Estimation: Factor = (500/24)/4 ≈ 5 This is the only action, before which an area expert has to be consulted!

16/18 Latice Boltzman

17/18 Results: SPT ≈ 1000 “Maxeler’s technology enables organizations to speed up processing times by 20-50x, with over 90% reduction in energy usage and over 95% reduction in data centre space”. Speedup factor: 1.2 x 1.6 x 5 x N ≈ 10N - Precisely Power reduction factor(i7/MAX3) = 17.6 / (MAX2 / MAX3) ≈ 10 - Precisely: the wall cord method Transistor count reduction factor = i7 / MAX3 - Precisely: about 20 Cost reduction factor: - Precisely: depends on the production volumes

Q&A: Hawaii Tahiti 10km/h ! 30km/h !!!