Download presentation
Presentation is loading. Please wait.
1
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer Engineering – Prof. Olivera Notaros
2
Project Goals: To develop parallel versions of applications that will run on a graphics card and measure the performance. Project Goals: To develop parallel versions of applications that will run on a graphics card and measure the performance. – Started with a simple Matrix Multiply program. – We intend to develop at least one or two additional applications and also to pursue an analysis of hardware optimizations. – Develop a process for tuning applications & hardware that other developers can use more easily.
3
Tyler Drake – Computer Science major Tyler Drake – Computer Science major Robert Wrisley – Computer Science/Computer Engineering dual major Robert Wrisley – Computer Science/Computer Engineering dual major Kyle Von Koepping – Electrical Engineering major Kyle Von Koepping – Electrical Engineering major Justin Walsh – Computer Science/Computer Engineering dual major Justin Walsh – Computer Science/Computer Engineering dual major Shared coding responsibilities Shared coding responsibilities – Enables comparison and greater understanding for all team members – Possibly divide responsibilities for the second half of the project
5
Transistor densities on single-core processors were doubling approximately every 18 months. Transistor densities on single-core processors were doubling approximately every 18 months. This trend has remained valid since first observed in 1965 and is expected to hold for several more years. This trend has remained valid since first observed in 1965 and is expected to hold for several more years. This natural trend had become the standard goal for hardware companies. This natural trend had become the standard goal for hardware companies.
7
There is an ultimate limit to Moore’s law. There is an ultimate limit to Moore’s law. Transistors will soon reach sizes of atomic level. Transistors will soon reach sizes of atomic level. Moore’s law does not apply to Random Access Memory (RAM) speeds and hard drive seek times. (AKA Memory Wall) Moore’s law does not apply to Random Access Memory (RAM) speeds and hard drive seek times. (AKA Memory Wall) Redesign of processor architecture isn’t driven directly by Moore’s Law, but by the fact that these and other factors have not kept up with this growth rate. Redesign of processor architecture isn’t driven directly by Moore’s Law, but by the fact that these and other factors have not kept up with this growth rate.
8
CPU or multiple CPU’s are not the only processors found on a personal computer CPU or multiple CPU’s are not the only processors found on a personal computer The graphics card has a graphics processing unit (GPU). The graphics card has a graphics processing unit (GPU). The GPU is specifically designed to render 3D models onto a 2D display The GPU is specifically designed to render 3D models onto a 2D display Designed for floating point computation with a highly parallel architecture. Designed for floating point computation with a highly parallel architecture.
9
Engineers have begun to exploit the highly parallel architecture of the GPU for general applications. Engineers have begun to exploit the highly parallel architecture of the GPU for general applications. Graphics companies encourage general purpose computing on the GPU (GPGPU). Graphics companies encourage general purpose computing on the GPU (GPGPU). Nvidia has developed CUDA (Compute Unified Device Architecture). Nvidia has developed CUDA (Compute Unified Device Architecture). Based on the C language programmers can easily shift to developing on the GPU Based on the C language programmers can easily shift to developing on the GPU
10
What We Have Done So Far
11
Learning about CUDA Learning about CUDA – NVIDIA CUDA guides – Lecture slides from University of Illinois, Urbana-Champaign – Papers from various academic groups University of Illinois, Urbana-Champaign University of Illinois, Urbana-Champaign Tokyo Institute of Technology Tokyo Institute of Technology University of California at Berkeley University of California at Berkeley Learning to write parallel programs in CS475 using MPI & OpenMP Learning to write parallel programs in CS475 using MPI & OpenMP Writing simple programs using CUDA and observing performance Writing simple programs using CUDA and observing performance – Matrix Multiply
12
Results Results Achieved 131 Gigaflops/sec on a GTX280 with N = 1024. GTX 280 peak is 933 Gigaflops/sec. Achieved 131 Gigaflops/sec on a GTX280 with N = 1024. GTX 280 peak is 933 Gigaflops/sec. Optimizations Optimizations Tiling the result matrix into smaller sub- matrices and having each thread block compute a sub-matrix will reduce amount of data needed to be loaded by each thread block. Tiling the result matrix into smaller sub- matrices and having each thread block compute a sub-matrix will reduce amount of data needed to be loaded by each thread block. This helps to reduce memory latency. This helps to reduce memory latency.
16
Memory Memory – Must allocate memory on the graphics card from the main program being run on the CPU – Memory for the graphics card is explicitly managed by the programmer An “extension” to C, not a separate language An “extension” to C, not a separate language – Similar to MPI, OpenMP, etc.
18
Increasing problem complexity Some are no longer “Pleasantly Parallel” Higher degree of kernel analysis Moving to more dynamic programs
19
Additional programs being written for the GPU include: Additional programs being written for the GPU include: – Scan: Matrix computation where the ith index is the sum of the previous i-1 indices! – Knapsack: profit maximization given a capacity and list of items with their weight & profit – Matrix Multiply for still larger matrices – Triangular Matrix Multiplication
20
Mandelbrot Set Pleasantly parallel, familiar Easily scalable
21
Ray Tracing Very computationally intensive Feasible for non- realtime computations Very dynamic, due to recursion High degree of realism
22
Examples of images generated by Ray Tracing
23
Hidden Markov Models Clear parallelism Wide range of applications
24
Uses of Hidden Markov Models
25
To develop a more complex application for the GPU and optimize the performance To develop a more complex application for the GPU and optimize the performance To analyze hardware optimizations and evaluate the performance gains To analyze hardware optimizations and evaluate the performance gains Develop a process for future programmers that will give them the best performance increases with the minimum development effort Develop a process for future programmers that will give them the best performance increases with the minimum development effort Please Note: These goals are tentative and subject to change. Please Note: These goals are tentative and subject to change.
27
Moore’s Law now being applied to processors per core instead of transistors per processor. Moore’s Law now being applied to processors per core instead of transistors per processor. Multi-core machines offer the next generation of performance enhancements… but they are already here! Multi-core machines offer the next generation of performance enhancements… but they are already here! GPUs provide massively parallel architectures that programmers can take advantage of to see phenomenal performance gains. GPUs provide massively parallel architectures that programmers can take advantage of to see phenomenal performance gains.
28
Learning to use the CUDA library and some of the nuances. Learning to use the CUDA library and some of the nuances. Have gotten good performance on Matrix- Multiply attempts. Have gotten good performance on Matrix- Multiply attempts. Also completing CUDA versions of Scan and Knapsack problems. Also completing CUDA versions of Scan and Knapsack problems. Move on to a more complex application. Move on to a more complex application. Researching hardware optimizations that can further enhance performance on GPUs. Researching hardware optimizations that can further enhance performance on GPUs. Develop a combined approach for future applications programmers to follow. Develop a combined approach for future applications programmers to follow.
29
$50 spent for a graphics card that is CUDA compatible. $50 spent for a graphics card that is CUDA compatible. We’d like to thank Prof. Dan Connors for the use of his machines with Nvidia GTX280 graphics cards. We’d like to thank Prof. Dan Connors for the use of his machines with Nvidia GTX280 graphics cards. – This provided us free access to a consistent build for all of us to run our code and sample code on. We don’t project any major costs next semester, except perhaps for some materials for our E-Days presentation. We don’t project any major costs next semester, except perhaps for some materials for our E-Days presentation.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.