John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis.

John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis

Outline Introduction  Application mappers  Historical background  Performance analysis today HLL runtime performance analysis tool  Motivation  Instrumentation  Framework  Visualization Case study  Molecular Dynamics Conclusions & References

Application Mappers Translates C code to HDL  Higher level of abstraction  Usually a subset of ANSI C No pointers No standard C libraries for FPGA  HDL is generated as a project file for Xilinx or Altera tools Built-in communication  Separate C source files are made for the CPU & FPGA  Similar communication function calls between CPU & FPGA

Application Mappers (continued) Computational parallelism  Pipelining of loops for(), while(), etc.  Use of library functions HDL coded functions called at HLL FFT, Floating point operations  Replication of functions defined in hardware Types of communication  DMA transfers Efficient transfer of large chucks of data  Stream transfers Steady flow of data Buffered for transfer rate changes

Introduction to the F2 project Goals for performance analysis in RC  Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs) Motivations  Complex systems are difficult to analyze by hand Manual instrumentation is unwieldy Difficult to make sense of large volume of raw data  Tools can help quickly locate performance problems Collect and view performance data with little effort Analyze performance data to indicate potential bottlenecks Staple in HPC, limited in HPEC, and virtually non-existent in RC Challenges  How do we expand notion of software performance analysis into software-hardware realm of RC?  What are common bottlenecks for dual-paradigm applications?  What techniques are necessary to detect performance bottlenecks?  How do we analyze and present these bottlenecks to a user?

Historical Background Gettimeofday and printf  VERY cumbersome, repetitive, manual, not optimized for speed Profilers date back to 70’s with “prof” (gprof, 1982)  Provide user with information about application behavior Percentage of time spent in a function How often a function calls another function Simulators / Emulators  Too slow or too inaccurate  Require significant development time PAPI (Performance Application Programming Interface)  Portable interface to hardware performance counters on modern CPUs  Provides information about caches, CPU functional units, main memory, and more ProcessorHW counters UltraSparc II2 Pentium 32 AMD Athlon4 IA-644 POWER48 Pentium 418 *Source: Wikipedia

7 Performance Analysis Today What does performance analysis look like today?  Goals Low impact on application behavior High-fidelity performance data Flexible Portable Automated Concise Visualization  Techniques Event-based, sample-based Profile, Trace Above all, we want to understand application behavior in order to locate performance problems!

Related Research and Tools: Parallel Performance Wizard (PPW) Open-source tool developed by UPC Group at University of Florida  Performance analysis and optimization (PGAS* systems and MPI support)  Performance data can be analyzed for bottlenecks Offers several ways of exploring performance data  Graphs and charts to quickly view high-level performance information at a glance [right, top]  In-depth execution statistics for identifying communication and computational bottlenecks  Interacts with popular trace viewers (e.g. Jumpshot [right, bottom]) for detailed analysis of trace data  Comprehensive support for correlating performance back to original source code *Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.

Motivation for RC Performance Analysis Dual-paradigm applications gaining more traction in HPC and HPEC  Design flexibility allows best use of FPGAs and traditional processors  Drawback: More challenging to design applications for dual-paradigm systems Parallel application tuning and FPGA core debugging are hard enough! Debug Performance Debug Performance Debug Performance Sequential Parallel Dual-Paradigm Difficulty level Less More No existing holistic solutions for analyzing dual-paradigm applications  Software-only views leave out low-level details  Hardware-only views provide incomplete performance information  Need complete system view for effective tuning of entire application

Motivation for RC Performance Analysis Q: Is my runtime load-balancing strategy working? A: ??? ChipScope waveform

Motivation for RC Performance Analysis Q: How well is my core’s pipelining strategy working? A: ??? gprof output (×N, one for each node!) Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init

Instrumentation Level High-level language (HLL)  Requires HLL timing functions  Application mapping disturbed by instrumentation Hardware Description Language (HDL)  Portable between HLL and types FPGA families  Selected level for instrumentation FPGA bit stream  Requires targeting specific FPGA family  Instrument in minutes

Instrumentation Selection Automated - Computation  State machines Used for preserving execution order in C functions Used to control state of pipelines  Control and status signals Used by library function Automated - Communication  Control and status signals Used for streaming communication Used for DMA transfers Application specific  Monitoring variables for meaningful values

Measurement Techniques Profiling  Counters Records number of occurrences of event  Low overhead Normally uses registers Block RAM can be used for state machines Tracing  Timestamps Indicating when event occurred  Data Associated with each event  Greater overhead Uses memory to store timestamps and data  Greater fidelity Reconstruction of sequence of events * * Zaki, O., Lusk, E., Gropp, W., and Swider, D. 1999. Toward Scalable Performance Visualization with Jumpshot. Int. J. High Perform. Comput. Appl. 13, 3 (Aug. 1999), 277-288. CPU-0 3 2 1 Time

Hardware Measurement Module

Uninstrumented ProjectInstrumentation added to C sourceC source for FPGA mapped to HDLInstrumentation added to HDLImplement hardware Adding Instrumentation & Measurement HLL Hardware Wrapper HLL API Wrapper Application (C source) Measurement Extraction Process/Thread Application (C source) FPGA(s) CPU(s)HLL Tool Flow C source Compile software Implement hardware Finished design Software -hardware mapping Instrumented Signals Instrumentation Loopback (C source) Application (HDL) Loopback (HDL) Hardware Measurement Module

Reverse Mapping & Analysis Mapping of HDL data back to HLL  Variable name-matching  Observing scope and other patterns Bottleneck detection  Load-balancing of replicated functions  Monitoring for pipeline stalls  Detecting streaming communication stalls  Finding shared-memory contention

Example RC Visualization Need unified visualizations that accentuate important statistics  Must be scalable to many nodes

Molecular Dynamics Simulation  Interactions between atoms and molecules discrete time intervals  Models forces Newtonian physics Van Der Walls forces Other interactions  Tracks molecules position and velocity X, Y and Z directions http://en.wikipedia.org/wiki/Molecular_dynamics

Case Study Setup Impulse C v2.2 XD1000 platform  Opteron 2.2 GHz  XD1000 module with Altera Stratix-II EP2S180 FPGA in second processor socket MD communication architecture  Chunks of MD data are read from SRAM  Data is streamed to multiple MD kernels that are pipelined  Results are stored back to SRAM

Impulse-C Profile Percentages Output stream of Molecular Dynamics kernel is a bottleneck.

Stream buffer size was increased by 32 times allowing application speedup to increase from 6.2 to 7.8 vs. serial baseline.

Performance Analysis Overhead EP2S180OriginalModifiedDifference Logic Used (143520) 126252 (87.97%) 131851 (91.87%) +5599 (+3.90%) Comb. ALUT (143520) 100344 (69.92%) 104262 (72.65%) +3918 (+2.73%) Registers (143520) 104882 (73.08%) 110188 (76.78%) +5306 (+3.70%) Block RAM (9383040 bits) 3437568 (36.64%) 3557376 (37.91%) +119808 (+1.27%) Frequency (MHz) 80.5778.44-2.13 (-2.64%) Additional FPGA resource usage  Less than 4% Frequency reduction  Less than 3%

Conclusions Developed prototype HLL-oriented RC performance analysis tool  First such runtime performance analysis tool framework (per extensive literature review)  Tracing & profiling available  Automated instrumentation in progress Application case study performed  Observed minimal overhead from tool  Speedup achieved due to performance analysis Future work  SRC support, automated instrumentation and analysis, integration with software PAT, further case studies

References Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field- Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society. Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287- 311, May 2006. C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society. Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007. S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High- Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007. J. Curreri, S. Koehler, B. Holland, and A. George, "Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing," Proc. of 16th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Palo Alto, CA, Apr. 14-15, 2008.

John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis.

Similar presentations

Presentation on theme: "John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis.

Similar presentations

Presentation on theme: "John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis."— Presentation transcript:

Similar presentations

About project

Feedback