F2: RC Performance Analysis

F2: RC Performance Analysis
Seth Koehler John Curreri Rafael Garcia

Presentation Overview
Introduction Overview and motivation Project details ReCAP framework and case studies HLL performance analysis and case study Common bottleneck analysis and optimization Architecture-aware visualization Conclusions I'll first provide an introduction and give background and related research for this proposal, ... then give an overview of this proposed research as well as its main objective. Next, I'll detail the approach and contributions of each phase of this research, and ... finally I'll give a projected timeline and conclude.

Introduction Reconfigurable Computing (RC) continues to grow and mature General-purpose architectures can be wasteful in terms of performance and power Impractical to have an ASIC for every application FPGAs (Field-Programmable Gate Arrays) Application-specific hardware and parallelism Retain flexibility and programmability RC applications typically employ CPUs and FPGAs Leverage strengths of both types of processors Potential for higher performance (as well as less power) System and application complexity can make it difficult to achieve this potential * While the concept of RC was originally proposed in the 1960's, it's really only been in the last 2 decades or so that this field has begun to expand and mature. * The main motivation for RC was to bridge the wide gap between CPUs and ASICs in terms of performance, flexibility, and power. * On one hand, CPUs are extremely flexible and easy to program, but (ref 1.1) * On the other hand, ASICs provide low power and high-performance, but it is (ref 1.2) ... in modern times, high-volume is usually required to amortize the costs of fabrication * Reconfigurable hardware, such as FPGAs, complements CPUs and ASICs, providing custom hardware and parallelism while retaining flexibility and programmability to accomplish the next task * RC applications combine the strengths of CPUs and RC hardware, typically FPGAs, to potentially increase performance and use less power than a CPU-only solution * Unfortunately, (ref 3)

Introduction (2) Must understand application behavior to improve performance Where does application spend most of its time? What system resources are heavily (or lightly) used by application? Where and why does application get delayed? Applications are too complex to study performance solely by analytics and simulation Extremely difficult to represent application accurately via formulas and simple paper-pencil analysis Accurate simulation can be computationally intensive and difficult to setup Need to observe actual behavior of application executing on an actual system at runtime Use (experimental) performance analysis to observe application behavior and determine strategy for optimization Validates (or invalidates) models used in analytics or simulation It is thus important to be able to improve performance and to do this we must understand application behavior And this is where performance analysis comes in...to observe (ref 3.1) In essence, performance analysis gives the final story of application performance

Overview – Performance Analysis
Performance analysis reduces guesswork in optimization Aides designer in locating and remedying application bottlenecks Replaces tedious, error-prone manual analysis methods Goals (adapted from Malony et. al) High fidelity, low perturbation, adaptable, portable, convenient, concise, intuitive Manual analysis methods (timing functions and printf statements) Instrument: Enable access to application (or system) data to be measured and stored at runtime Measure: Record and store data while application is executed on target system Analyze: Identify potential performance bottlenecks from measured data (optional) Present: Visualize measured data and (optionally) results from analysis Analyze (Manual): From performance data, deduce potential bottlenecks and optimization strategies Optimize: Modify application, hopefully gaining speedup Fidelity: record sufficient detail to reconstruct application behavior Conservation: do not affect program correctness Low perturbation: alter application’s performance behavior as little as possible Adaptability & portability: monitor diverse applications & systems Convenience: require little effort from designer Concise: present only data that captures application behavior (don't overwhelm user) Intuitive: present data to allow rapid understanding

Overview – RC systems / applications
RC systems and applications are even more complex than in HPC Heterogeneous components Hierarchy of parallelism among components Lack of visibility inside RC devices Optimizing applications is crucial for effective use of these systems Performance analysis tools are relied on heavily in HPC to productively optimize applications Performance analysis tools are even more essential in RC due to additional system and application complexity, and yet research is lacking Objective: expand the notion and benefits of software performance analysis into the software-hardware realm of RC

Overview - Instrumentation
Choosing an instrumentation level (source, binary, intermediate) Intermediate: may be inaccessible and not much benefit Source: portable across devices, flexible, low overhead, easy Binary: fast (minutes vs. hours), portable across languages Modifying an application Source-level: standard preprocessor / grammar-based parsing Binary-level: alter LUTs and routing fabric (Xilinx JBits SDK) Non-portable across even devices in the same family May fail if routing fabric is heavily used in given location Choosing signals to instrument Must monitor key data within resource constraints Monitor off-chip communication via port of top-level file Monitor clocked control via signals used in conditional blocks e.g., state machines, if-else blocks, pipeline stall / flush, loop counters HDL source expressiveness can complicate matters Far better than a flattened, placed, and routed design Thus, source instrumentation chosen for our framework User input for communication includes designation of whether signal is data or control, and specifically what control specifies when communication is or isn't occurring. Really 2 cons for source instrumentation...need to handle multiple HDLs and need to incur an entire synthesis-and-implement cycle...the first is mitigated by the fact that there are 2 common HDLs that can be mixed, the second is an unfortunate fact of life for HDL designs currently, namely that PAR takes forever Image c/o Altera

Overview - Measurement
Obtaining & storing performance data without perturbing application Tracing: timestamp (cycle counter) with event and associated data May generate a significant amount of data Carefully define events to limit amount of data generated Transfer performance data to larger memories at runtime Profiling: summary statistics (counts, averages, etc.) May be difficult to determine application behavior from statistics Sampling: data points Measure value periodically to reduce overhead but still get some temporal information Managing shared resources May share off-chip resources (communication channels & memory) Arbitrate communication or memory channel, assigning unused addresses or bits to measurement hardware May be impossible to automatically detect what bits or addresses are unused Could cause significant perturbation of the application We employ all three forms of measurement and use platform-specific knowledge to manage shared resources Interestingly, we have the potential to incur no overhead for our monitoring given parallelism inside the device, unlike software performance analysis Sampling make sense with slow moving or monotonic data Need large memories to store performance data, usually not available on-chip

ReCAP Framework Reconfigurable-Computing Application Performance (ReCAP) framework HDL Instrumenter Hardware Measurement Module (HMM) Hardware Data Transfer Module (HDTM) RC-enabled version of PPW (PPW+RC) So, with this research, we developed the ReCAP framework to address these challenges

ReCAP: HDL Instrumenter
Parses VHDL, returning list of signals, variables, inputs, outputs, etc. from design hierarchy Any items may be selected for monitoring More advanced screen allows configuration parameters to be set precisely Extracts selected data through design hierarchy to top-level file Adds Hardware Measurement Module (HMM) into design Connects extracted data to HMM Overrides interface to allow communication with HMM through shared channel HDL Instrumenter Instrumentation Process

ReCAP: Hardware Measurement Module
Hardware necessary to record, store, and retrieve data at runtime Profiling, tracing, and sampling Cycle counter (for timing) and other module statistics (trace records available/dropped, counter overflow, etc.) Buffers for storing trace data Module control for performance data retrieval and miscellaneous control (e.g., clear and stop) Instrumentation Process Hardware Measurement Module (HMM)

Instrumentation Process
ReCAP: PPW+RC and HDTM PPW+RC backend adds Hardware Data Transfer Module (HDTM) Data transfer thread Locking mechanism (since we now have shared FPGA access) Data storage and migration of FPGA performance data to PPW data structures Normal PPW backend instruments software code PPW+RC frontend presents measured data for CPUs and FPGAs Table and chart views across multiple experiments Export to Jumpshot for timeline views Instrumentation Process PPW+RC

N-Queens results for board size of 16
Case Study: N-Queens Q Overview Find number of distinct ways n queens can be placed on an n×n board without attacking each other (via backtracking algorithm) Multi-CPU/FPGA application (UPC/VHDL) Overhead <= 6% area (sixteen 32-bit profile counters for state machines) <= 2% memory (96-bit-wide trace buffer for core finish time) Negligible frequency degradation observed N-Queens results for board size of 16 XD1 Xeon-H101 Original Instr. Slices (% relative to device) 9,041 9,901 (+4%) 23,086 26,218 (+6%) Block RAM 11 15 (+2%) 21 22 (0%) Frequency (MHz) (% relative to orig.) 124 123 (-1%) 101 Communication (KB/s) <1 33 30 FPGAs

Case Study: N-Queens (cont)
Main state machine optimized based on performance data 8-node speedup increased from 33.9 to 37.1 over serial baseline (10.5%) Reset attack checker state combined with valid solution state. Graphics c/o John Curreri

Case study: 2D-PDF estimation*
Application Estimate a 2D probability density function (i.e., nearly smooth histogram) given set of (x, y) coordinate data 3.2GHz Xeon, Virtex-4 LX100 FPGA, PCI-X 14% slices, 15% block RAM, 100MHz Setup 5 profile counters monitored main state machine Additional 576 slices (1.2% of device), no frequency degradation, 6% runtime increase Results 45.1% of time hardware is idle, waiting for data Hardware spends less than 2% of time on communication, software spends over 95% of time on communication (polling) Double buffering data on FPGA could reclaim all idle time, providing (ideally) 85% speedup Software functions FPGA Write FPGA Read Hardware state machine Idle Idle time Compute * 2D-PDF code written by Karthik Nagarajan

Case study: Collatz conjecture (3x+1)
Application Search for sequences that do not reach 1 under the following function 3.2GHz Xeon, Virtex-4 LX100, PCI-X 88% slices, 22% block RAM, 100MHz Setup 17 profile counters – 3 state machines Additional 1,089 slices (2.2% of device), no frequency degradation Results Bottleneck 1: frequent, small CPU-to-FPGA communication 31% speedup achieved by buffering data before sending to FPGA Larger buffer yield up to 74% speedup Bottleneck 2: data distribution Ideal speedup of 11% not large enough to merit additional work currently FPGA Write FPGA Read CPU Prefilter FPGA Read FPGA Write CPU Verify

HLL Performance Analysis
High-level languages Impulse-C and Carte C Convert subset of C to HDL Employ DMA and streaming communication Speedup gained by Pipelining loops Library functions Replicated functions Impulse C Pipelining of loops Determined by pragmas in code Carte (SRC) Automatic pipelining of inner most loop Called as C function HDL coded Automated instrumentation Computation State machines Used for preserving execution order in C functions Used to control pipelines Control and status signals used by library functions Communication Control and status signals Streaming communication DMA transfers User-assisted instrumentation Application-specific variables Monitor meaningful values selected by user Measurement Employ HMM from HDL framework

HLL Instrumentation & Measurement
CPU(s) HLL Tool Flow C source Application (C source) Measurement Extraction Process/Thread Instrumentation Software -hardware mapping HLL API Wrapper Compile software Instrumentation FPGA(s) Implement hardware HLL Hardware Wrapper Loopback (C source) Loopback (HDL) Instrumented Signals Application (HDL) Application (C source) Hardware Measurement Module Finished design Instrumentation added to C source Instrumentation added to HDL Implement hardware Uninstrumented Project C source for FPGA mapped to HDL

HLL Analysis & Visualizations
Bottleneck detection (currently user-assisted) Load-balancing of replicated functions Monitoring for pipeline stalls Detecting streaming communication stalls Finding shared-memory contention Integration with performance analysis tool Profiling data Pie charts showing time utilization Tree view of CPU and FPGA timing HDL State Machine C Source Main MD loop Input stream Pipeline transition b4s0 b4s1 b4s2 b4s3 b4s4 b6s0 b6s1 Output steam ?

Case Study: Molecular Dynamics
HLL Stream buffer Increased buffer size by 32 times Speedup change 6.2 vs. serial baseline before enhancements 7.8 vs. serial baseline after enhancements Molecular Dynamics Simulates interaction of molecules over discrete time steps Impulse C version 2.2 XD1000 platform Dual-processor motherboard Opteron 2.2GHz Stratix-II EP2S180 XD1000 module MD communication architecture Chunks of MD data read from SRAM Data streamed to multiple pipelined MD kernels Results stored back to SRAM

Common Bottleneck Analysis
Explore common RC bottlenecks, associated detection techniques, and optimization strategies What are common RC bottlenecks? For each common bottleneck... what conditions constitute a bottleneck? what data must be acquired to detect it? is measurement overhead acceptable? what optimizations can be suggested? what additional information would be useful in making a better decision? Goal: reduce time and expertise needed to optimize application Embed "expert" information in tool One of the methodologies demonstrated in traditional HPC performance analysis was the use of common bottleneck detection for analysis, reducing the effort and expertise required from the application designer. Diverse applications often have similar performance bottlenecks (e.g., communication overhead, poor load-balancing)

Common RC Bottlenecks Landscape of potential RC bottlenecks
Leverage bottlenecks discovered and categorizations employed in traditional performance analysis literature Leverage CHREC RC experience

Detection and Optimization
For each identified bottleneck category determine Detection strategies Behavior that constitutes this bottleneck e.g., "write enable" is high more than 50% of cycles Possibilities for locating this bottleneck e.g., trace "write enable" OR determine average cycles "write enable" is high Benefits and drawbacks associated with each strategy Supporting data and optimization suggestions Severity of bottleneck Relative priority of bottleneck (based on global interactions between all components as well as severity) Suggested design changes e.g., amount of replication, additional buffering, stage balancing, etc. Predicted performance if suggested improvements were implemented Additional information (if any) Data needed to better determine a course of action Facilitates honing in on bottlenecks via iterative process Additional Information: A number of items are searched for initially, with more in-depth suggestions for monitoring thereafter

Presentation Challenges/Techniques
Need scalable visualizations to locate and intuitively understand common and application-specific bottlenecks Traditional visualizations use homogeneous-parallel-thread model All threads execute on same or similar hardware All threads assumed to be at same level in application hierarchy Single thread cannot execute tasks in parallel This model ill-suited for RC applications Components not general purpose Heterogeneous parallelism Multiple-hierarchies of parallelism Parallelism inside components Simultaneous communication & computation Enable designers to quickly locate and understand both common and application- specific performance bottlenecks Connecting threads to physical hardware (CPUs) is unnecessary No explicit relationship between threads (e.g., groups, master/slave) given in visualization e.g., thread is either communicating, waiting, or computing Goal is to provide enough information to allow future tools to make well-informed decisions on what visualizations to implement as well as the needed data and techniques for acquiring that data. Traditional Timeline Visualization

Visualization Landscape
Performance data contexts Application context Designer knows application as written in source – performance should be presented in this context (e.g., functions, components, source files) System context Application is running on real system – performance data should be presented in this context (e.g., utilization of CPUs, FPGAs, network and interconnects) Common-bottleneck context Analysis data should be included to pinpoint problems (e.g., common synchronization or communication issues) Common-bottleneck context (KOJAK [5]) Application context (PPW) System context (PPW)

Architecture-Aware Visualization
Visualization within application & system context Integrate common-bottleneck data Must be scalable to large systems Choose data to include/exclude at each level Auto-generating system and application architecture desirable System level Node level Component categorization (e.g., buffer, process, memory) Categorization of states or other conditional blocks (e.g., idle, busy, full, empty)

Conclusions Runtime performance analysis of RC applications is critical for productively optimizing these applications and effectively using RC systems ReCAP provides automated framework/tool to study application behavior while minimizing overhead & effort Instrumentation and measurement for recording, storing, and transferring performance data at runtime Common bottleneck detection & optimization strategies Presentation of CPU/FPGA performance data & common bottlenecks in scalable, intuitive visualizations (prototype) ReCAP represents the first RC application performance framework and tool (per extensive literature review)

F2: RC Performance Analysis

Similar presentations

Presentation on theme: "F2: RC Performance Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

F2: RC Performance Analysis

Similar presentations

Presentation on theme: "F2: RC Performance Analysis"— Presentation transcript:

Similar presentations

About project

Feedback