Download presentation
Presentation is loading. Please wait.
Published byBrett Hines Modified over 9 years ago
1
John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis
2
Outline Introduction Application mappers Historical background Performance analysis today HLL runtime performance analysis tool Motivation Instrumentation Framework Visualization Case study Molecular Dynamics Conclusions & References
3
Application Mappers Translates C code to HDL Higher level of abstraction Usually a subset of ANSI C No pointers No standard C libraries for FPGA HDL is generated as a project file for Xilinx or Altera tools Built-in communication Separate C source files are made for the CPU & FPGA Similar communication function calls between CPU & FPGA
4
Application Mappers (continued) Computational parallelism Pipelining of loops for(), while(), etc. Use of library functions HDL coded functions called at HLL FFT, Floating point operations Replication of functions defined in hardware Types of communication DMA transfers Efficient transfer of large chucks of data Stream transfers Steady flow of data Buffered for transfer rate changes
5
Introduction to the F2 project Goals for performance analysis in RC Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs) Motivations Complex systems are difficult to analyze by hand Manual instrumentation is unwieldy Difficult to make sense of large volume of raw data Tools can help quickly locate performance problems Collect and view performance data with little effort Analyze performance data to indicate potential bottlenecks Staple in HPC, limited in HPEC, and virtually non-existent in RC Challenges How do we expand notion of software performance analysis into software-hardware realm of RC? What are common bottlenecks for dual-paradigm applications? What techniques are necessary to detect performance bottlenecks? How do we analyze and present these bottlenecks to a user?
6
Historical Background Gettimeofday and printf VERY cumbersome, repetitive, manual, not optimized for speed Profilers date back to 70’s with “prof” (gprof, 1982) Provide user with information about application behavior Percentage of time spent in a function How often a function calls another function Simulators / Emulators Too slow or too inaccurate Require significant development time PAPI (Performance Application Programming Interface) Portable interface to hardware performance counters on modern CPUs Provides information about caches, CPU functional units, main memory, and more ProcessorHW counters UltraSparc II2 Pentium 32 AMD Athlon4 IA-644 POWER48 Pentium 418 *Source: Wikipedia
7
7 Performance Analysis Today What does performance analysis look like today? Goals Low impact on application behavior High-fidelity performance data Flexible Portable Automated Concise Visualization Techniques Event-based, sample-based Profile, Trace Above all, we want to understand application behavior in order to locate performance problems!
8
Related Research and Tools: Parallel Performance Wizard (PPW) Open-source tool developed by UPC Group at University of Florida Performance analysis and optimization (PGAS* systems and MPI support) Performance data can be analyzed for bottlenecks Offers several ways of exploring performance data Graphs and charts to quickly view high-level performance information at a glance [right, top] In-depth execution statistics for identifying communication and computational bottlenecks Interacts with popular trace viewers (e.g. Jumpshot [right, bottom]) for detailed analysis of trace data Comprehensive support for correlating performance back to original source code *Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.
9
Motivation for RC Performance Analysis Dual-paradigm applications gaining more traction in HPC and HPEC Design flexibility allows best use of FPGAs and traditional processors Drawback: More challenging to design applications for dual-paradigm systems Parallel application tuning and FPGA core debugging are hard enough! Debug Performance Debug Performance Debug Performance Sequential Parallel Dual-Paradigm Difficulty level Less More No existing holistic solutions for analyzing dual-paradigm applications Software-only views leave out low-level details Hardware-only views provide incomplete performance information Need complete system view for effective tuning of entire application
10
Motivation for RC Performance Analysis Q: Is my runtime load-balancing strategy working? A: ??? ChipScope waveform
11
Motivation for RC Performance Analysis Q: How well is my core’s pipelining strategy working? A: ??? gprof output (×N, one for each node!) Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init
12
Instrumentation Level High-level language (HLL) Requires HLL timing functions Application mapping disturbed by instrumentation Hardware Description Language (HDL) Portable between HLL and types FPGA families Selected level for instrumentation FPGA bit stream Requires targeting specific FPGA family Instrument in minutes
13
Instrumentation Selection Automated - Computation State machines Used for preserving execution order in C functions Used to control state of pipelines Control and status signals Used by library function Automated - Communication Control and status signals Used for streaming communication Used for DMA transfers Application specific Monitoring variables for meaningful values
14
Measurement Techniques Profiling Counters Records number of occurrences of event Low overhead Normally uses registers Block RAM can be used for state machines Tracing Timestamps Indicating when event occurred Data Associated with each event Greater overhead Uses memory to store timestamps and data Greater fidelity Reconstruction of sequence of events * * Zaki, O., Lusk, E., Gropp, W., and Swider, D. 1999. Toward Scalable Performance Visualization with Jumpshot. Int. J. High Perform. Comput. Appl. 13, 3 (Aug. 1999), 277-288. CPU-0 3 2 1 Time
15
Hardware Measurement Module
16
Uninstrumented ProjectInstrumentation added to C sourceC source for FPGA mapped to HDLInstrumentation added to HDLImplement hardware Adding Instrumentation & Measurement HLL Hardware Wrapper HLL API Wrapper Application (C source) Measurement Extraction Process/Thread Application (C source) FPGA(s) CPU(s)HLL Tool Flow C source Compile software Implement hardware Finished design Software -hardware mapping Instrumented Signals Instrumentation Loopback (C source) Application (HDL) Loopback (HDL) Hardware Measurement Module
17
Reverse Mapping & Analysis Mapping of HDL data back to HLL Variable name-matching Observing scope and other patterns Bottleneck detection Load-balancing of replicated functions Monitoring for pipeline stalls Detecting streaming communication stalls Finding shared-memory contention
18
Example RC Visualization Need unified visualizations that accentuate important statistics Must be scalable to many nodes
19
Molecular Dynamics Simulation Interactions between atoms and molecules discrete time intervals Models forces Newtonian physics Van Der Walls forces Other interactions Tracks molecules position and velocity X, Y and Z directions http://en.wikipedia.org/wiki/Molecular_dynamics
20
Case Study Setup Impulse C v2.2 XD1000 platform Opteron 2.2 GHz XD1000 module with Altera Stratix-II EP2S180 FPGA in second processor socket MD communication architecture Chunks of MD data are read from SRAM Data is streamed to multiple MD kernels that are pipelined Results are stored back to SRAM
21
Impulse-C Profile Percentages Output stream of Molecular Dynamics kernel is a bottleneck.
22
Stream buffer size was increased by 32 times allowing application speedup to increase from 6.2 to 7.8 vs. serial baseline.
23
Performance Analysis Overhead EP2S180OriginalModifiedDifference Logic Used (143520) 126252 (87.97%) 131851 (91.87%) +5599 (+3.90%) Comb. ALUT (143520) 100344 (69.92%) 104262 (72.65%) +3918 (+2.73%) Registers (143520) 104882 (73.08%) 110188 (76.78%) +5306 (+3.70%) Block RAM (9383040 bits) 3437568 (36.64%) 3557376 (37.91%) +119808 (+1.27%) Frequency (MHz) 80.5778.44-2.13 (-2.64%) Additional FPGA resource usage Less than 4% Frequency reduction Less than 3%
24
Conclusions Developed prototype HLL-oriented RC performance analysis tool First such runtime performance analysis tool framework (per extensive literature review) Tracing & profiling available Automated instrumentation in progress Application case study performed Observed minimal overhead from tool Speedup achieved due to performance analysis Future work SRC support, automated instrumentation and analysis, integration with software PAT, further case studies
25
References Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field- Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society. Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287- 311, May 2006. C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society. Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007. S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High- Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007. J. Curreri, S. Koehler, B. Holland, and A. George, "Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing," Proc. of 16th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Palo Alto, CA, Apr. 14-15, 2008.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.