Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Parallelized Evolution System Onur Soysal, Erkin Bahçeci Erol Şahin Dept. of Computer Engineering Middle East Technical University.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
A new crossover technique in Genetic Programming Janet Clegg Intelligent Systems Group Electronics Department.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA and the Memory Model (Part II). Code executed on GPU.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Interpreting a genetic programming population on an nVidia Tesla W. B. Langdon CREST lab, Department of Computer Science.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
What is Genetic Programming? Genetic programming is a model of programming which uses the ideas (and some of the terminology) of biological evolution to.
GPU Architecture and Programming
"Distributed Computing and Grid-technologies in Science and Education " PROSPECTS OF USING GPU IN DESKTOP-GRID SYSTEMS Klimov Georgy Dubna, 2012.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
CS 732: Advance Machine Learning
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
Basic CUDA Programming
Real-Time Ray Tracing Stefan Popov.
Lecture 2: Intro to the simd lifestyle and GPU internals
Boltzmann Machine (BM) (§6.4)
Graphics Processing Unit
6- General Purpose GPU Programming
Coevolutionary Automated Software Correction
Presentation transcript:

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering and Computer Sciences

Overview Graphics Processing Units (GPUs) are no longer limited to be used only for Graphics: High degree of programmability Fast floating point operations GPUs are now GPGPUs Genetic programming is a computationally intensive methodology so a prime candidate for using GPUs. 2

Outline Genetic Programming Genetic Programming Resource Demands GPU Programming Genetic Programming on GPU Automatically Defined Functions 3

Genetic Programming (GP) Evolutionary algorithm-based methodology To optimize a population of computer programs Tree based representation Example: 4 XOutput

GP Resource Demands GP is notoriously resource consuming CPU cycles Memory Standard GP system, 1μs per node Binary trees, depth 17: 131 ms per tree Fitness cases: 1,000 Population size: 1,000 Generations: 1,000 Number of runs: 100 »Runtime: 10 Gs ≈ 317 years Standard GP system, 1ns per node »Runtime: 116 days Limits to what we can approach with GP 5 [Banzhaf and Harding – GECCO 2009]

Sources of Speed-up Fast machines Vector Processors Parallel Machines (MIMD/SIMD) Clusters Loose Networks Multi-core Graphics Processing Units (GPU) 6

General Purpose Computation on GPU GPUs are not just for graphics operations High degree of programmability Fast floating point operations Useful for many numeric calculations Examples Physical simulations (e.g. fluids and gases) Protein Folding Image Processing 7

Why GPU is faster than CPU ? 8 The GPU Devotes More Transistors to Data Processing. [CUDA C Programming Guide Version 3.2 ]

GPU Programming APIs There are a number of toolkits available for programming GPUs. CUDA MS Accelerator RapidMind Shader programming So far, researchers in GP have not converged on one platform 9

CUDA Programming Massive number (>10000) of light-weight threads. 10

CUDA Memory Model 11 (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host CUDA exposes all the different types of memory on the GPU: [CUDA C Programming Guide Version 3.2 ]

CUDA Programming Model GPU is viewed as a computing device operating as a coprocessor to the main CPU (host). Data-parallel, computationally intensive functions should be off-loaded to the device. Functions that are executed many times, but independently on different data, are prime candidates, i.e. body of for-loops. A function compiled for the device is called a kernel. 12

13

Stop Thinking About What to Do and Start Doing It! Memory transfer time expensive. Computation is cheap. No longer calculate and store in memory Just recalculates Built-in variables threadIdx blockIdx gridDim blockDim 14

Example: Increment Array Elements 15

Example: Matrix Addition 16

Example: Matrix Addition 17

Parallel Genetic Programming While most GP work is conducted on sequential computers, the following computationally intensive features make it well suited to parallel hardware: Individuals are run on multiple independent training examples. The fitness of each individual could be calculated on independent hardware in parallel. Multiple independent runs of the GP are needed for statistical confidence to the stochastic element of the result. 18 [Langdon and Banzhaf, EuroGP-2008]

A Many Threaded CUDA Interpreter for Genetic Programming Running Tree GP on GPU 8692 times faster than PC without GPU Solved 20-bits Multiplexor 2 20 = fitness cases Has never been solved by tree GP before Previously estimated time: more than 4 years GPU has consistently done it in less than an hour Solved 37-bits Multiplexor 2 37 = fitness cases Has never been attempted before GPU solves it in under a day 19 [W.B.Langdon, EuroGP-2010]

Boolean Multiplexor d = 2 a n = a + d Num test cases = 2 n 20-mux 1 million test cases 37-mux 137 billion test cases 20 [W.B.Langdon, EuroGP-2010]

Genetic Programming Parameters for Solving 20 and 37 Multiplexors Terminals20 or 37 Boolean inputs D0 – D19 or D0 – D36 respectively FunctionsAND, OR, NAND, NOR FitnessPseudo random sample of 2048 of or 8192 of fitness cases. Tournament4 members run on same random sample. New samples for each tournament and each generation. Population Initial Population Ramped half-and-half 4:5 (20-Mux) or 5:7 (37-Mux) Parameters50% subtree crossover, 5% subtree 45% point mutation. Max depth 15, max size 511 (20-Mux) or 1023 (37-Mux) Termination5000 generations 21 [W.B.Langdon, EuroGP-2010] Solutions are found in generations 423 (20-Mux) and 2866 (37-Mux).

AND, OR, NAND, NOR XYX & Y XYX d Y XYX r Y XYX | Y AND: & NOR: r NAND: d OR: |

Evolution of 20-Mux and 37-Mux 23 [W.B.Langdon, EuroGP-2010]

6-Mux Tree I 24 [W.B.Langdon, EuroGP-2010]

6-Mux Tree II 25 [W.B.Langdon, EuroGP-2010]

6-Mux Tree III 26 [W.B.Langdon, EuroGP-2010]

Ideal 6-Mux Tree 27

Automatically Defined Functions (ADFs) Genetic programming trees often have repeated patterns. Repeated subtrees can be treated as subroutines. ADFs is a methodology to automatically select and implement modularity in GP. This modularity can: Reduce the size of GP tree Improve readability 28

Langdon’s CUDA Interpreter with ADFs ADFs slow down the speed 20-Mux taking 9 hours instead of less than an hour 37-Mux taking more than 3 days instead of less than a day Improved ADFs Implementation Previously used one thread per GP program Now using one thread block per GP program Increased level of parallelism Reduced divergence 20-Mux taking 8 to 15 minutes 37-Mux taking 7 to 10 hours 29

ThreadGP Scheme Every GP program is interpreted by its own thread. All fitness cases for a program evaluation are computed on the same stream processor As several threads interpreting different programs are run on each multiprocessor, a higher level of divergence may be expected than with the BlockGP scheme. 30 [Denis Robilliard, Genet Program Evolvable Mach-2009]

BlockGP Scheme Every GP program is interpreted by all threads running on a given multiprocessor. No divergence due to differences between GP programs, since multiprocessors are independent. However divergence can still occur between stream processors on the same multiprocessor, when: an if structure resolves into the execution of different branches within the set of fitness cases that are processed in parallel. 31 [Denis Robilliard, Genet Program Evolvable Mach-2009]

6-Mux with ADF 32

6-Mux with ADF 33

6-Mux with ADF 34

Conclusion 1: GP Powerful machine learning algorithm Capable of searching through trillions of states to find the solution Often have repeated patterns and can be compacted by ADFs But computationally expensive 35

Conclusion 2: GPU Computationally fast Relative low cost Need new programming paradigm, which is practical. Accelerates processing speed up to 3000 times for computationally intensive problems. But not well suited for memory intensive problems. 36

Acknowledgement Dr Will Browne and Dr Mengjie Zhang for Supervision. Kevin Buckley for Technical Support. Eric for helping in CUDA compilation. Victoria University of Wellington for Awarding “Victoria PhD Scholarship”. All of You for Coming. 37

38 Thank You Questions?