Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.

Slides:

Advertisements

Similar presentations

PIPELINE AND VECTOR PROCESSING

Advertisements

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.

P3- Represent how data flows around a computer system

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Distributed Arithmetic

Fast Modular Reduction

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

CS150 Introduction to Computer Science 1

Newtonian N-Body Simulator Michelle Teh, Michael Witten April 27, 2007.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Objective 2.01 Test Review Name: Class Period:.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

Machine level architecture Computer Architecture Basic units of a Simple Computer.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Fan Zhang, Yang Gao and Jason D. Bakos

GCSE Computing#BristolMet Session Objectives#11 MUST identify what program instructions consist of SHOULD describe how instructions are coded as bit patterns.

Computer Architecture and the Fetch-Execute Cycle

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.

GPU Architecture and Programming

Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,

May Wu Jinyuan, (Fermilab Huang Yifei (IMSA) 1 An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab)

Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.

How Computers Work in Simple english Dr. John P. Abraham Professor UTPA.

QCAdesigner – CUDA HPPS project

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.

Computer Systems Organization

MSE Presentation 1 Lakshmikanth Ganti

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Synchronization These notes introduce:

Sasa Stojanovic Veljko Milutinovic

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Contingency table analyses Miloš Radić 12/0010 1/14.

Miloš Kotlar 2012/115 Single Layer Perceptron Linear Classifier.

CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.

Performed by:Liran Sperling Gal Braun Instructor: Evgeny Fiksman המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory.

Progress check Learning Objective: Success Criteria : Can identify various input and output devices - Level 4 – 5 Can identify all the major items of hardware.

ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 25 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 8 – Example School.

W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux.

CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.

Instruction Memory value Description ADD1xx Add the value stored at memory address xx to the value in the accumulator register SUB2xx Subtract the value.

All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! 1 ITCS 4/5145 Parallel Computing,

3.1.4 Hardware a. describe the function and purpose of the control unit, memory unit and ALU (arithmetic logic unit) as individual parts of a computer;

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

Controlled Kernel Launch for Dynamic Parallelism in GPUs

THE CPU i Bytes 1.1.

Chapter 2 – Computer hardware

Cache Memory Presentation I

FPGAs in AWS and First Use Cases, Kees Vissers

Sci. 4-2 What is a Machine? Pages

Gouraud-shaded Triangle Rasterization

Data Representation – Instructions

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Array Processor.

Vectors Vectors are a way to describe motion that is not in a straight line. All measurements can be put into two categories: Scalars = magnitude Vectors.

Parallel Computation Patterns (Reduction)

Processing Computer Components.

Find the velocity of a particle with the given position function

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

MARIE: An Introduction to a Simple Computer

See requirements for practice program on next slide.

Scalar and vector quantities

Objectives Describe common CPU components and their function: ALU Arithmetic Logic Unit), CU (Control Unit), Cache Explain the function of the CPU as.

Force Directed Placement: GPU Implementation

Presentation transcript:

Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies

CPU loads particle data into DRAM for every iteration. (every N*N cycles)

A new set of 4 values is read from DRAM in every cycle.

CPU loads particle data into DRAM for every iteration. (every N*N cycles) A new set of 4 values is read from DRAM in every cycle. 16 force computations are done based on 16 scalar inputs and the 4 values read earlier. The pipeline and accumulator are described in another slide.

CPU loads particle data into DRAM for every iteration. (every N*N cycles) A new set of 4 values is read from DRAM in every cycle. 16 force computations are done based on 16 scalar inputs and the 4 values read earlier. The pipeline and accumulator are described in another slide. Every pipeline outputs 12 partial sums after ‘N’ cycles.

CPU loads particle data into DRAM for every iteration. (every N*N cycles) A new set of 4 values is read from DRAM in every cycle. 16 force computations are done based on 16 scalar inputs and the 4 values read earlier. The pipeline and accumulator are described in another slide. Every pipeline outputs 12 partial sums after ‘N’ cycles. CPU adds the 12 partial sums together (for every particle), updates velocities, updates positions and re-writes into the DRAM.

for(int j=0;j<N/PAR;j++) { max_set_scalar_input(device,"RowSumKernel.N",N,FPGA_A);//set scalar inputs max_set_scalar_input_f(device,"RowSumKernel.EPS",EPS,FPGA_A); for(int p=0;p<PAR;p++) { max_set_scalar_input_f(device,pi_x[p],px[j*PAR+p],FPGA_A); max_set_scalar_input_f(device,pi_y[p],py[j*PAR+p],FPGA_A); max_set_scalar_input_f(device,pi_z[p],pz[j*PAR+p],FPGA_A); } max_run//run the kernel ( device, max_output("ax",outputX,12*PAR*sizeof(float)), max_output("ay",outputY,12*PAR*sizeof(float)), max_output("az",outputZ,12*PAR*sizeof(float)), max_runfor("RowSumKernel",N), max_end() ); for(int i=0;i<12*PAR;i++)//sum up the partial sums { ax[j*PAR+(i/12)]+=outputX[i]; ay[j*PAR+(i/12)]+=outputY[i]; az[j*PAR+(i/12)]+=outputZ[i]; } //update velocity //update position //load memory N Cycles N/PAR times Host C code

Pipeline and Accumulator: 1.1 Input per cycle: P_j data from DRAM. 2.Acceleration: accumulated as 12 partial sums.

Resource Usage Resource Usage for 16 fold parallel 150MHz: LUTs: / (52.43%) FFs: / (27.98%) BRAMs: 433 / 1064 (40.70%) 288 / 2016 (14.29%)

Performance: Comparison Seconds Particles

Performance: Speedup Speedup Particles

38400 Particles