QCAdesigner – CUDA HPPS project

Slides:



Advertisements
Similar presentations
SHREYAS PARNERKAR. Motivation Texture analysis is important in many applications of computer image analysis for classification or segmentation of images.
Advertisements

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
Implementing a Speech Recognition System on a GPU using CUDA
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.
High Speed Digital Systems Lab. Agenda  High Level Architecture.  Part A.  DSP Overview. Matrix Inverse. SCD  Verification Methods. Verification Methods.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Synchronization These notes introduce:
Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
Expediting Peer-to-Peer Simulation using GPU Di Niu, Zhengjun Feng Apr. 14 th, 2009.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Implementation of Efficient Check-pointing and Restart on CPU - GPU
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
CS/EE 217 – GPU Architecture and Parallel Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Chapter 5: Computer Systems Organization
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Ray Tracing on Programmable Graphics Hardware
Implementation of a De-blocking Filter and Optimization in PLX
Synchronization These notes introduce:
Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu
6- General Purpose GPU Programming
Presentation transcript:

QCAdesigner – CUDA HPPS project Giovanni Paolo Gibilisco Marconi Francesco Miglierina Marco

Outline Introduction to QCA Why CUDA? Scope & Motivation Implementation Results Conclusions

QCA QCA is a technology based on Quantum Dots. Four quantum dots are placed to make a cell, in which are introduced some electrons. The cell has 2 possible stable configurations. The behaviour of each cell in a circuit is controlled by a clock signal the circuit uses 4 clock shifted of 1

QCA Basic blocks can be created aligning different cells Clock phases define the flow of information into the circuit QCADesigner is a tool developed in the University of British Columbia to design and simulate this kind of circuits. Majority Not A Not A C B Out

QCADesigner QCADesigner offers a GUI in which circits can be built There are also two simulation engines: Bistable Coherence vector We focuse only on the bistable engine.

Scope & Motivation The aim of this project is to speedup the execution of the Bistable simulation engine in the QCADesigner tool by paralleizing its execution. Bisteble engine verification is used frequently in the development of a circuit as a form of debugging. This simulation engine need to be very fast to effectivly support designers. Actual execution time for a circuit with 6 input 7000 cells is more than 1h.

Bistable simulation Bistable simlation engine characteristics: Fast simulator Simplifies the physics behind QCA Parameters that mostly affect simulation time are number of samples and number of cells For an exhaustive verification the number of sample grows explonentially with the number of inputs

Bistable Simulation Simulation algorithm Profiling has shown that the 99% of the time is spent in this loop.

CUDA To speed up the simulation we need to parallelize the computation of the new state of each cell As long as the code that update the polarization is the same we need a SIMD architecture to execute the code more efficently The great number of cores offered by CUDA architectures makes it a good candidate for this kind of parallelization

CUDA CUDA (Compute Unified Device Architecture) Is a programming model developed by Nvidia Allows the use of GPU to execute parallel code Cuda abstracts hardware parallelism with the concept of blocks and threads. The main structure needed to exploit CUDA parallelism is the Kernel which is a parallel function invoked by the host and executed on the device

Implementation The main challenge in the implementation of the simulator has been preserving the correctness of outputs. The parallel simulation code follows these macro-steps:

Implementation Parallel simulation Algorithm Load data file, find neighours, calculate kink energy. Original structures are mapped to array, indexes are saved to acces new strucutres in the kernel. A coloring algorithm is applied to the graph representing the circuit in order to find non neighboring cells that can evolve simultaneously.

Implementation Cells which represent input of the circuit are updated. This is done on the device with an ad-hoc kernel. The color order is randomized to minimize numerical errors. The main simulation kernel is called once for each color. Each run of the kernel updates only cell’s of the current evolving color. At the end of this phase all cells are updated with the new polarization. This process is repeated until the circuit converges. When the circuit has stabilized the value of output cells are stored and the loop starts again for the next sample. When the execution of all samples has ended the simulation output is stored in the form of binary outcome, values of output cells polarization and an image with the shape of the output wave.

Memory Exploitation Shared memory Constant memory Array with all indexes to inputs and outputs Clock value Numbr of cells Number of sample Stability tolerance Small structures frequently accessed are stored in shared memory. Data which doesn’t change during the simulation are stored in constant memory Big data accessed by all threads are stored in global memory Global memory Array with all cell’s polarization Array with all kink energy

Results Comparison between CPU and GPU simulation time.

Results Speedup.

Conclusions The new implementation scales very well with the number of cells. Maximum speedup reached over 8x Better knowledge of both QCA and CUDA technology Future improvement: Better exploitation of memory hierarchy Fixing oscillations in order to avoid coloring Implementing a non exhaustive verification

Questions ?