Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain, Jason Fritts Washington University in St. Louis Funded by NSF under grant Sep 22 Liquid Architecture Extracting & Improving Micro-architecture Performance on Reconfigurable Architectures
Application Performance ArchitectureCompiler Algorithm
Customization cost/ performance tradeoff GenericFPGACustom Generic processor - cheap but application-agnostic; compilers exist; compiler optimization is the key Reconfigurable logic - subject of our study; architecture and compiler research are the key Customized logic - ideal for an application but expensive; logic/architecture research is key
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware ~ $200 - $500 Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions ~ $200 - $2000 Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware x~ $500, ,000,000+
Hardware platform overview FPGA Standard ISA SPARC 8 Instrumentation and v ariations FPX Interface support modules (VHDL) Memory, Network interface chip, … Interne t Development Workstation FPX research was supported by NSF: ANI and Xilinx Corp.
Hardware platform details FPX FPGA
Hardware platform details FPX Core I-CACHE D-CACHE Cache Controller LEON - SPARC8 compatible & Open soft core LEON
Hardware platform details FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM LEON - SPARC8 compatible & Open soft core LEON
Application execution FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc BLASTN DNA Sequence Comparison
Application runtime FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Results & Timing Slow! Where is time spent?
Software approach to profiling “time” Start with the program Introduce timers Run the instrumented program Execution Timings Timers must account for their own overhead Instrumented program will run slower Instrumentation skews runtime as it affects system behavior such as cache, …
Profiling is free with liquid architecture!
Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation pc Statistics Module Event monitor bus Request Timings
Method Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Choose methods to profile from the user interface Liquid architecture: cycle-accurate profiling for free
Method Address Range.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Liquid architecture: cycle-accurate profiling for free Hi 0x C Lo
Method.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free
Function.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A ≤≤ Counter Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR
Function.text main addQuery findMatch computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A ≤≤ Counter PCCLK 0x F Hi 0x400005D8 Lo 0x A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR
0x400003EF Hi 0x C Lo 0x A ≤≤ Counter PCCLK 0x F Hi 0x400005D8 Lo 0x A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free To Command Controller INCR
Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation pc Statistics Module Event monitor bus Request Timings findMatch 500ms coreLoop 300ms
“Where time was spent” for BLASTN…
Cycle-accurate profiling No application overhead Hence, at full speed
Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus pc Is cache the problem?
Software approach to profiling cache Not possible to profile by coding!! Simulate cache behavior Cache Simulator Timings Slow !!
Software approach to profiling “cache” Scale down the program Simulate cache behavior Cache Simulator Timings Cannot afford to simulate the entire program Not possible to profile by coding!!
How do we detect and report cache behavior using Liquid Architecture?
Interface extends to include cache behavior options… Liquid architecture: cache behavior for free Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd
Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Cache Hits / Misses ReadWrite
Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus pc
Cache behavior Hits and misses in LEON
Cache behavior These signals are fed into the Event Monitoring Bus
Cache behavior Statistics Module
Cache behavior Statistics Module Statistics Module counts events
Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus Reads hits misses Writes hits misses pc
% Cache hit rate for D-cache: 1KB Function-wise cache profiling, in reasonable time
Liquid architecture enables fast, accurate results Seconds: fast, but no cache performance data available
Liquid architecture enables fast, accurate results Days: so slow you wouldn’t do this on the whole program
Liquid architecture enables fast, accurate results ½ hour: Practical, reasonably fast, totally accurate
Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Pipeline Stalls Branch Predict Can profile all other aspects of micro-architecture too…
How do we use the profiling info to improve application performance?
Reconfigure micro-architecture
Reconfiguration FPGA Control S/W Interface Command Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation Core I-CACHE D-CACHE Cache Controller I-CACHE D-CACHE Cache Controller
Cache hits after D-cache reconfiguration
Conclusion for “large” run: D-cache doesn’t make much difference. Hit rate is already very high
Cache hits after D-cache reconfiguration
Conclusion for “small” run: Larger cache helps… Increased Associativity does not help as much
App runtime after I -cache reconfiguration
Larger I-cache doubles application performance for both “small” and “large” runs
What have we learned about BLASTN?
½ execution time in two methods
What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance
What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance
What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance Area better spent on I-cache not D-cache for this application
What can we do next?
Most execution spent on hash functions findMatch(String) Access array Hash array index
FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc Reconfigure ISA + hash instruction
FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc Reconfigure ISA hash instruction
Our development environment
To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K)
Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port
Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port
Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port Ethernet device driver to mount NFS file systems
Operating system call profiling Just select them in the interface…
Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery read Pipeline Stalls Branch Predict
Recap
Recap - Extracting & Improving Performance on Reconfigurable Architectures
Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure
Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed
Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance
Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance Currently –Reconfigure ISA and modify compiler –Automate –Profile operating system calls
Questions? FPX Hardware Module built At WashU Serial port Gigabit Ethernet FPGA device with LEON core
Hardware development flow Interface support mod VHDL Compile Simulate (Modelsim) Synthesize (Synplicity) Place n’ Route (Virtex 2000E) Verify LEON VHDL
Modular Design Flow (our contribution) Place and Route with constraints (Xilinx) Synthesize Logic to gates & flops (Synplicity Pro) Front End: Specify Regular Expression (Web, PHP) Install and deploy modules over Internet to remote scanners (NCHARGE) Set Boundry I/O & Routing Constraints (DHP) Back End (2): Generate Finite State Machines in VHDL Generate bitstream (Xilinx) In-System, Data Scanning on FPX Platform Back End (1): Extract Search terms from SQL database New, 2 Million-gate Packet Scanner: 9 Minutes
Function-wise profiling
Next steps - Automate configuration Application Trace Analyzer Architecture Generator Synthesis Compiler FPX Platform Reconfiguration Server Reconfiguration Cache Dynamic Adaptation Analysis + Architecture Generation Configuration Archive Simulation
Next steps - Automate (re)configuration FPGA Control S/W Interface LEON Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation Config Controller LEON-v1.0 I-CACHE D-CACHE Cache Controller LEON-v2.0 I-CACHE D-CACHE Cache Controller LEON-v3.0 I-CACHE D-CACHE Cache Controller