Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain,

Slides:



Advertisements
Similar presentations
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Advertisements

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
© 2003 Xilinx, Inc. All Rights Reserved Debugging.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Extensible Networking Platform 1 Liquid Architecture Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain,
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Configurable System-on-Chip: Xilinx EDK
The Xilinx EDK Toolset: Xilinx Platform Studio (XPS) Building a base system platform.
Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.
GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Technion Digital Lab Project Performance evaluation of Virtex-II-Pro embedded solution of Xilinx Students: Tsimerman Igor Firdman Leonid Firdman.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Benefits of Partial Reconfiguration Reducing the size of the FPGA device required to implement a given function, with consequent reductions in cost and.
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Applied research laboratory David E. Taylor Users Guide: Fast IP Lookup (FIPL) in the FPX Gigabit Kits Workshop 1/2002.
Automated Design of Custom Architecture Tulika Mitra
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
J. Christiansen, CERN - EP/MIC
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Srihari Makineni & Ravi Iyer Communications Technology Lab
- Washington University in St. Louis Apr 26, 2004 Liquid Architecture.
Extensible Networking Platform Lockwood / Zuver - Applied Research Laboratory -- Extensible Networking Development of a System-On-Chip Extensible.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
EE3A1 Computer Hardware and Digital Design
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Hot Interconnects TCP-Splitter: A Reconfigurable Hardware Based TCP/IP Flow Monitor David V. Schuehler
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
A Monte Carlo Simulation Accelerator using FPGA Devices Final Year project : LHW0304 Ng Kin Fung && Ng Kwok Tung Supervisor : Professor LEONG, Heng Wai.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
This material exempt per Department of Commerce license exception TSU Xilinx On-Chip Debug.
Internet Worm and Virus Protection for Very High-Speed Networks John W. Lockwood Professor of Computer Science and Engineering
Greg Alkire/Brian Smith 197 MAPLD An Ultra Low Power Reconfigurable Task Processor for Space Brian Smith, Greg Alkire – PicoDyne Inc. Wes Powell.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Somervill RSC 1 125/MAPLD'05 Reconfigurable Processing Module (RPM) Kevin Somervill 1 Dr. Robert Hodson 1
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Survey of Reconfigurable Logic Technologies
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Liquid Architecture D. Schuehler, B. Brodie, R. Chamberlain, R. Cytron, S. Friedman, J. Fritts, P. Jones, P. Krishnamurthy, J. Lockwood, S. Padmanabhan,
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the Field Programmable Port Extender John Lockwood and David Taylor Washington University.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Programmable Logic Devices
Dynamo: A Runtime Codesign Environment
Chapter 1: A Tour of Computer Systems
A Review of Processor Design Flow
THE ECE 554 XILINX DESIGN PROCESS
Portable SystemC-on-a-Chip
NetFPGA - an open network development platform
THE ECE 554 XILINX DESIGN PROCESS
Presentation transcript:

Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain, Jason Fritts Washington University in St. Louis Funded by NSF under grant Sep 22 Liquid Architecture Extracting & Improving Micro-architecture Performance on Reconfigurable Architectures

Application Performance ArchitectureCompiler Algorithm

Customization cost/ performance tradeoff GenericFPGACustom Generic processor - cheap but application-agnostic; compilers exist; compiler optimization is the key Reconfigurable logic - subject of our study; architecture and compiler research are the key Customized logic - ideal for an application but expensive; logic/architecture research is key

Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard

Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application

Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware

Liquid architecture combines the best of all options Standard Architecture Standardized ISA, existing compilers xNot optimized for any specific application xFixed instructions and hardware ~ $200 - $500 Liquid Architecture on FPGA ISA + extras, can use modified open-source tools Hardware can be optimized for specific application Reconfigurable ISA; ~100us – 100ms; person hours and not $millions ~ $200 - $2000 Custom Architecture on Integrated Circuit × One-of-a-kind, nonstandard Optimized for specific application xFixed instructions and hardware x~ $500, ,000,000+

Hardware platform overview FPGA Standard ISA SPARC 8 Instrumentation and v ariations FPX Interface support modules (VHDL) Memory, Network interface chip, … Interne t Development Workstation FPX research was supported by NSF: ANI and Xilinx Corp.

Hardware platform details FPX FPGA

Hardware platform details FPX Core I-CACHE D-CACHE Cache Controller LEON - SPARC8 compatible & Open soft core LEON

Hardware platform details FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM LEON - SPARC8 compatible & Open soft core LEON

Application execution FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc BLASTN DNA Sequence Comparison

Application runtime FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Results & Timing Slow! Where is time spent?

Software approach to profiling “time” Start with the program Introduce timers Run the instrumented program Execution Timings Timers must account for their own overhead Instrumented program will run slower Instrumentation skews runtime as it affects system behavior such as cache, …

Profiling is free with liquid architecture!

Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation pc Statistics Module Event monitor bus Request Timings

Method Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Choose methods to profile from the user interface Liquid architecture: cycle-accurate profiling for free

Method Address Range.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Liquid architecture: cycle-accurate profiling for free Hi 0x C Lo

Method.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free

Function.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A ≤≤ Counter Stats Module PCCLK Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR

Function.text main addQuery findMatch computeKey computeBase coreLoop fillQuery Rnd 0x400003EF Hi 0x C Lo 0x A ≤≤ Counter PCCLK 0x F Hi 0x400005D8 Lo 0x A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free INCR

0x400003EF Hi 0x C Lo 0x A ≤≤ Counter PCCLK 0x F Hi 0x400005D8 Lo 0x A ≤≤ Counter Stats Module Event Monitor Bus Liquid architecture: cycle-accurate profiling for free To Command Controller INCR

Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation pc Statistics Module Event monitor bus Request Timings findMatch 500ms coreLoop 300ms

“Where time was spent” for BLASTN…

Cycle-accurate profiling No application overhead Hence, at full speed

Cycle-accurate profiling for free FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus pc Is cache the problem?

Software approach to profiling cache Not possible to profile by coding!! Simulate cache behavior Cache Simulator Timings Slow !!

Software approach to profiling “cache” Scale down the program Simulate cache behavior Cache Simulator Timings Cannot afford to simulate the entire program Not possible to profile by coding!!

How do we detect and report cache behavior using Liquid Architecture?

Interface extends to include cache behavior options… Liquid architecture: cache behavior for free Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd

Function Time / Cycles.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Cache Hits / Misses ReadWrite

Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus pc

Cache behavior Hits and misses in LEON

Cache behavior These signals are fed into the Event Monitoring Bus

Cache behavior Statistics Module

Cache behavior Statistics Module Statistics Module counts events

Cache profiling FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation Statistics Module Event monitor bus Reads hits misses Writes hits misses pc

% Cache hit rate for D-cache: 1KB Function-wise cache profiling, in reasonable time

Liquid architecture enables fast, accurate results Seconds: fast, but no cache performance data available

Liquid architecture enables fast, accurate results Days: so slow you wouldn’t do this on the whole program

Liquid architecture enables fast, accurate results ½ hour: Practical, reasonably fast, totally accurate

Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd Pipeline Stalls Branch Predict Can profile all other aspects of micro-architecture too…

How do we use the profiling info to improve application performance?

Reconfigure micro-architecture

Reconfiguration FPGA Control S/W Interface Command Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation Core I-CACHE D-CACHE Cache Controller I-CACHE D-CACHE Cache Controller

Cache hits after D-cache reconfiguration

Conclusion for “large” run: D-cache doesn’t make much difference. Hit rate is already very high

Cache hits after D-cache reconfiguration

Conclusion for “small” run: Larger cache helps… Increased Associativity does not help as much

App runtime after I -cache reconfiguration

Larger I-cache doubles application performance for both “small” and “large” runs

What have we learned about BLASTN?

½ execution time in two methods

What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance

What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance

What have we learned about BLASTN? ½ execution time in two methods D-cache size not an influence on performance Large I-cache doubles the performance Area better spent on I-cache not D-cache for this application

What can we do next?

Most execution spent on hash functions findMatch(String) Access array Hash  array index

FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc Reconfigure ISA + hash instruction

FPGA FPX LEON Core I-CACHE D-CACHE Cache Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Control S/W Interface Command Controller Workstation program gcc Reconfigure ISA hash instruction

Our development environment

To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K)

Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port

Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port

Our development environment To avoid reloading programs during re-run –loaded embedded operating system - ucLinux kernel (~200K) UART serial port Ethernet device driver to mount NFS file systems

Operating system call profiling Just select them in the interface…

Function Time / Cycles Cache Hits / Misses ReadWrite.text main findMatch addQuery computeKey computeBase coreLoop fillQuery read Pipeline Stalls Branch Predict

Recap

Recap - Extracting & Improving Performance on Reconfigurable Architectures

Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure

Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed

Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance

Recap - Extracting & Improving Performance on Reconfigurable Architectures Platform –Standard ISA, to leverage existing compilers –FPGAs, to instrument and reconfigure Profiling –Cycle-accurate –Non-intrusive –At full speed Reconfiguration –Reconfigure micro-architecture to improve performance Currently –Reconfigure ISA and modify compiler –Automate –Profile operating system calls

Questions? FPX Hardware Module built At WashU Serial port Gigabit Ethernet FPGA device with LEON core

Hardware development flow Interface support mod VHDL Compile Simulate (Modelsim) Synthesize (Synplicity) Place n’ Route (Virtex 2000E) Verify LEON VHDL

Modular Design Flow (our contribution) Place and Route with constraints (Xilinx) Synthesize Logic to gates & flops (Synplicity Pro) Front End: Specify Regular Expression (Web, PHP) Install and deploy modules over Internet to remote scanners (NCHARGE) Set Boundry I/O & Routing Constraints (DHP) Back End (2): Generate Finite State Machines in VHDL Generate bitstream (Xilinx) In-System, Data Scanning on FPX Platform Back End (1): Extract Search terms from SQL database New, 2 Million-gate Packet Scanner: 9 Minutes

Function-wise profiling

Next steps - Automate configuration Application Trace Analyzer Architecture Generator Synthesis Compiler FPX Platform Reconfiguration Server Reconfiguration Cache Dynamic Adaptation Analysis + Architecture Generation Configuration Archive Simulation

Next steps - Automate (re)configuration FPGA Control S/W Interface LEON Controller AHB Address/ Data bus Memory Controller SRAM / SDRAM Statistics Module Event monitor bus FPX program gcc Workstation Config Controller LEON-v1.0 I-CACHE D-CACHE Cache Controller LEON-v2.0 I-CACHE D-CACHE Cache Controller LEON-v3.0 I-CACHE D-CACHE Cache Controller