Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.

Slides:



Advertisements
Similar presentations
N-Body Simulation Michael Mersic CS680.
Advertisements

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Saeed Ebrahimijam Spring 2013 Faculty of Business and Economics Department of Banking and Finance Doğu Akdeniz Üniversitesi FINA417.
Technical Analysis EXTRA. Support & Resistance support is the price level through which a stock or market seldom falls Resistance, on the other hand,
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Fluid Simulation using CUDA Thomas Wambold CS680: GPU Program Optimization August 31, 2011.
With technical analysis, timing is the critical success factor. Technical Analysis serves to determine "when to buy or when to sell" shares. It is concerned.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Saeed Ebrahimijam SPRING Faculty of Business and Economics Department of Banking and Finance Doğu Akdeniz Üniversitesi FINA417.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Chapter 16 Jones, Investments: Analysis and Management
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
GPU Architecture and Programming
Optimising Cuts for HLT George Talbot Supervisor: Stewart Martin-Haugh.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
QCAdesigner – CUDA HPPS project
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GAIN: GPU Accelerated Intensities Ahmed F. Al-Refaie, S. N. Yurchenko, J. Tennyson Department of Physics Astronomy - University College London - Gower.
Sunpyo Hong, Hyesoon Kim
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
 Analysis of statistics generated by market activity such as past price and volume to come up with reasonable outcome in future using charts as a primary.
Generalized and Hybrid Fast-ICA Implementation using GPU
Gwangsun Kim, Jiyun Jeong, John Kim
Cache Memory and Performance
Villanova Technical Analysis Group
CSC 4250 Computer Architectures
Basic Performance Parameters in Computer Architecture:
The Opening Bell Deviation Theory
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Disjoint Subsets Investing Efficiency Analysis
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Lecture 5: GPU Compute Architecture for the last time
6- General Purpose GPU Programming
Presentation transcript:

Dan Iannuzzi Kevin Pine CS 680

Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties Encountered Lessons Learned Future Work Opportunities Conclusion

The Problem Every day millions of trades are recorded per stock Traders want to test a given strategy of trading on some combination of stock indicators We must get a hold of all this stock data per stock Run all desired stock analysis Simulate buy/sell actions based on analysis Display results

Recap of CS676 Project Stock data stored in many csv files (each stock having many data points each) Read and store stock data Loop on each stock Run calculations on 3 chosen stock market analysis indicators Keep track of the buy/sell signals for each indicator Buy/sell stock as appropriate, tracking if sell is gain or loss Print out number of trades, number of gains, number of losses, average gain, and average loss

Parallelization Done in CS676 Two main types of parallelization performed: File I/O parallelization done using OpenMP loop Parallelization of the calculation of the 3 indicators done for each stock done using OpenMP Stock data stored in map from stock name to list of data Move private map iterator forward by number of threads Process full list of stock data for each iterator Further performance refinements made to optimize based on initial results that were observed Results Focus was on parallelizing the simulation Reached a sim speedup of about 9 Efficiency was above.9 until 10 threads for sim Time

Goals of This Research Analyze CUDA implementation to determine the speedup over a sequential C implementation Analyze different types of CUDA programming strategies Work split across multiple GPUs Using different types of GPU memory (i.e.: pinned vs. shared vs. constant) Time various aspects of the implementation Copy time to and from the device (found most of our time spent here) Computation time needed for the buy/sell simulation

Approach Convert C++ implementation to C Simplified data read by condensing data into 1 file Replaced C++ standards with C standards (ie: STL maps to C structs) Compile using nvcc compiler and verify functionality matches C++ version by comparing outputs on same data set Convert CPU methods to device methods Launch a thread per set of stock data points Each thread responsible for fully processing all indicators for the one of the stock’s data points Experiment with different implementations and approaches to parallelize the computations on the GPU using various CUDA mechanisms

Parallelization Attempts Each thread handles set of stock data elements from original data set and we do the 3 technical indicator calculations in parallel Achieved approx. 2.2 speedup Concluded we spent too much time copying memory Attempted to use zero-pinned memory to remove copying costs We saw really poor performance and concluded that we simply had too many reads and had too much of a penalty per read We also believe that with an integrated GPU this would have been much more successful

Attempts Con’d Attempted to increase the data set size, but hit memory limitations on GPU so tried blocking the GPU calls Allowed us to increase the data to 8, 16, and 32 times the original data set Saw only 2.4 speedup and concluded we simply did not have enough computation per data point and was spending all our time copying memory Reduce the size of our data structure that was being copied This led to much less of a performance hit due to the memory copying and we saw speedup around 3.55 We felt without reworking the structure of the program we were losing data and thus abandoned this approach, but it did show how strong the memory copying penalty was

More Attempts Use two GPUs, which in theory should decrease the time spent copying the data since done in parallel This with the original data set yielded slightly better results over 1 GPU Again concluded our problem was not enough computation per data point transferred to GPU Increased the computation per data point by using 2 of 3 indicators x number of times Combined with multiple GPUs and this is the ending project result, which will be discussed in a minute

Partial Attempts Shared Memory Attempted to put stock data into shared memory that all threads in a block would need Realized what we were doing really didn’t make since for shared memory (no relation between each threads work) Use constant memory for stock data since only need read op Constant memory is only 64K and each stock data struct is 112 bytes and thus we can only fit 585 stock data pts in constant memory at a time. This would require lots of blocking (over 6 million data pts in our data set and easily can be in the billions!) Tests on a small dataset showed no increase in performance, but perhaps the data set was being cached in the sequential, no further work done

Experimental Machine Conducted timings on float.cs.drexel.edu Float has 8 cores at 2.13 GHz, 12M cache, 28GB Ram Float has 2 GeForce GTX 580 cards, each which has max of 1024 threads per block and blocks per grid Testing was done by manual comparison of answers to known correct sequential program from CS676 All graphed numbers were generated by taking 3 samples. The other numbers mentioned were not created through a formal timing process We used 1024 blocks and 128 threads for all tests as it seemed to yield the best results in spot testing Implementation benchmarked is 1 and 2 GPUs varying the number of indicators calculated

We were unable to calculate the computations/second due to the large number of things going on with the various indicators, etc. Here is runtimes for your general reference.

Memory Copying Analysis StockData struct size = 112 bytes ResultData struct size = 24 bytes Size of int = 4 bytes Num Stocks: 2797 Num data pts: Stock Data size: bytes (837 MB approx.) Result Data size: bytes (179 MB approx.) Min Index size: bytes (11KB approx.) Total Memory: bytes (1 GB approx.) This was split over 2 devices, so a total of about 500MB per device is being copied

Computeprof Results For 100x indicators, we got 3.51% time spent on memory copying For 1x indicator, we got 64.2% time spent on memory copying These results match our expectations, that without enough computation, the memory copying penalty is too steep to see much performance gain We also conclude with a large number of indicators streams will not be helpful, but with a smaller number we can make use of them and use many GPUs to increase overall performance

Difficulties Encountered Difficult to convert a C++ program to a C program Most difficult part was all the manual memory handling needed for our C structs over the STL Lots of options when trying to parallelize using CUDA

Lessons Learned CUDA is very application specific Lots of different tradeoffs needed to find best approach to parallelization on GPU Number of blocks and number of threads per block Using multiple streams vs. single stream Determining the best way to implement across multiple devices Need to invest time to understand the tools available to a developer using CUDA Debugger Profiler (computeprof)

Future Work Opportunities Implement more complex indicators Implement indicators where computations may be able to be split over the threads, instead of having a thread do all the computations for each stock data point. In this scenario shared memory becomes much more useful! Use multiple streams to avoid long upfront delays copying stock data Implement on an integrated GPU to avoid the penalty of copying across the PCI express

Conclusions In scenarios where there is a large amount of data that the GPU will need, you need more GPUs 4.2 to 8.4 by using 2 GPUs at 2001 indicators Need enough computation to offset copying to GPU This application is much more data intensive than computation intensive per data point, which may not be a perfect fit for the GPU without considerable redesign of the problem (or doing different more complex indicators) Speedup not as great as we had hoped Lots of opportunities to make this research better Learned a lot about CUDA in a short amount of time

Questions/Comments?

Technical Indicators Used Moving Average Convergence/Divergence (MACD) Measures momentum of a stock Calculated by looking at the difference between two exponential moving averages over the last n days Shorter exponential moving average of MACD value used as signal Movement of MACD compared to signal indicates start and stop of trends Relative Strength Index Momentum oscillator indicating velocity and magnitude of price movement Measured from 0 to 100 Above 70 suggests overbought, below 30 suggests oversold Stochastic Oscillator Momentum indicator comparing closing price to price over a period of time Ranges from 0 to 100 Above 80 suggests overbought, below 20 suggests oversold