Download presentation
Presentation is loading. Please wait.
Published byRey Dingley Modified over 9 years ago
1
Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain, Jason Fritts Washington University in St. Louis http://liquid.arl.wustl.edu Funded by NSF under grant 03-13203 Sep 22 Liquid Architecture Extracting & Improving Micro-architecture Performance on Reconfigurable Architectures
2
Application Performance ArchitectureCompiler Algorithm
3
Customization cost/ performance tradeoff GenericFPGACustom Generic processor - cheap but application-agnostic; compilers exist; compiler optimization is the key Reconfigurable logic - subject of our study; architecture and compiler research are the key Customized logic - ideal for an application but expensive; logic/architecture research is key
4
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard
5
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application
6
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware
7
Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware ~ $200 - $500 Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions ~ $200 - $2000 Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware x~ $500,000 - 1,000,000+
8
Hardware platform overview FPGA Standard ISA SPARC 8 Instrumentation and v ariations FPX Interface support modules (VHDL) Memory, Network interface chip, … Interne t Development Workstation FPX research was supported by NSF: ANI-0096052 and Xilinx Corp.
9
Hardware platform details FPX FPGA
10
Hardware platform details FPX Core I-CACHE D-CACHE Cache Controller LEON - SPARC8 compatible & Open soft core LEON
11
Hardware platform details FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM LEON - SPARC8 compatible & Open soft core LEON
12
Application execution FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 program gcc 001010 110110 001110 BLASTN DNA Sequence Comparison
13
Application runtime FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 Results & Timing Slow! Where is time spent?
14
Software approach to profiling “time” Start with the program Introduce timers Run the instrumented program Execution Timings Timers must account for their own overhead Instrumented program will run slower Instrumentation skews runtime as it affects system behavior such as cache, …
15
Profiling is free with liquid architecture!
16
Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 pc Statistics Module Event monitor bus Request Timings
17
Method Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Choose methods to profile from the user interface Liquid architecture: cycle-accurate profiling for free
18
Method Address Range.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Liquid architecture: cycle-accurate profiling for free Hi 0x4000027C Lo
19
Method.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x4000027C Lo 0x4000035A Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free
20
Function.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x4000027C Lo 0x4000035A ≤≤ Counter Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR
21
Function.text main addQuery findMatch computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x4000027C Lo 0x4000035A ≤≤ Counter PCCLK 0x4000061F Hi 0x400005D8 Lo 0x4000035A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR
22
0x400003EF Hi 0x4000027C Lo 0x4000035A ≤≤ Counter PCCLK 0x4000061F Hi 0x400005D8 Lo 0x4000035A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free To Command Controller INCR
23
Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 pc Statistics Module Event monitor bus Request Timings findMatch 500ms coreLoop 300ms
24
“Where time was spent” for BLASTN…
25
Cycle-accurate profiling No application overhead Hence, at full speed
26
Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 Statistics Module Event monitor bus pc Is cache the problem?
27
Software approach to profiling cache Not possible to profile by coding!! Simulate cache behavior Cache Simulator Timings Slow !!
28
Software approach to profiling “cache” Scale down the program Simulate cache behavior Cache Simulator Timings Cannot afford to simulate the entire program Not possible to profile by coding!!
29
How do we detect and report cache behavior using Liquid Architecture?
30
Interface extends to include cache behavior options… Liquid architecture: cache behavior for free Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd
31
Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Cache Hits / Misses ReadWrite
32
Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 Statistics Module Event monitor bus pc
33
Cache behavior Hits and misses in LEON
34
Cache behavior These signals are fed into the Event Monitoring Bus
35
Cache behavior Statistics Module
36
Cache behavior Statistics Module Statistics Module counts events
37
Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 Statistics Module Event monitor bus Reads hits misses Writes hits misses pc
38
% Cache hit rate for D-cache: 1KB Function-wise cache profiling, in reasonable time
39
Liquid architecture enables fast, accurate results Seconds: fast, but no cache performance data available
40
Liquid architecture enables fast, accurate results Days: so slow you wouldn’t do this on the whole program
41
Liquid architecture enables fast, accurate results ½ hour: Practical, reasonably fast, totally accurate
42
Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Pipeline Stalls Branch Predict Can profile all other aspects of micro-architecture too…
43
How do we use the profiling info to improve application performance?
44
Reconfigure micro-architecture
47
Reconfiguration FPGA Control S/W Interface Command Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation Core I-CACHE D-CACHE Cache Controller 001010 110110 001110 I-CACHE D-CACHE Cache Controller
48
Cache hits after D-cache reconfiguration
50
Conclusion for “large” run: D-cache doesn’t make much difference. Hit rate is already very high
51
Cache hits after D-cache reconfiguration
52
Conclusion for “small” run: Larger cache helps… Increased Associativity does not help as much
53
App runtime after I -cache reconfiguration
54
Larger I-cache doubles application performance for both “small” and “large” runs
55
What have we learned about BLASTN?
56
½ execution time in two methods
57
What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance
58
What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance
59
What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance Area better spent on I-cache not D-cache for this application
60
What can we do next?
61
Most execution spent on hash functions findMatch(String) Access array Hash array index
62
FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 program gcc Reconfigure ISA + hash instruction
63
FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation 001010 110110 001110 program gcc Reconfigure ISA 001010 110110 001110 + hash instruction
64
Our development environment
65
To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K)
66
Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port
67
Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port
68
Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port Ethernet device driver to mount NFS file systems
69
Operating system call profiling Just select them in the interface…
70
Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery read Pipeline Stalls Branch Predict
71
Recap
72
Recap - Extracting & Improving Performance on Reconfigurable Architectures
73
Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure
74
Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed
75
Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance
76
Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance Currently –Reconfigure ISA and modify compiler –Automate –Profile operating system calls
77
Questions? http://liquid.arl.wustl.edu FPX Hardware Module built At WashU Serial port Gigabit Ethernet FPGA device with LEON core
78
Hardware development flow Interface support mod VHDL Compile Simulate (Modelsim) Synthesize (Synplicity) Place n’ Route (Virtex 2000E) Verify LEON VHDL
79
Modular Design Flow (our contribution) Place and Route with constraints (Xilinx) Synthesize Logic to gates & flops (Synplicity Pro) Front End: Specify Regular Expression (Web, PHP) Install and deploy modules over Internet to remote scanners (NCHARGE) Set Boundry I/O & Routing Constraints (DHP) Back End (2): Generate Finite State Machines in VHDL Generate bitstream (Xilinx) In-System, Data Scanning on FPX Platform Back End (1): Extract Search terms from SQL database New, 2 Million-gate Packet Scanner: 9 Minutes
80
Function-wise profiling
81
Next steps - Automate configuration Application Trace Analyzer Architecture Generator Synthesis Compiler FPX Platform Reconfiguration Server Reconfiguration Cache Dynamic Adaptation Analysis + Architecture Generation Configuration Archive Simulation
82
Next steps - Automate (re)configuration FPGA Control S/W Interface LEON Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation 001010 110110 001110 Config Controller LEON-v1.0 I-CACHE D-CACHE Cache Controller LEON-v2.0 I-CACHE D-CACHE Cache Controller LEON-v3.0 I-CACHE D-CACHE Cache Controller
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.