Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

Slides:



Advertisements
Similar presentations
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Advertisements

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
August 17, 2000 Hot Interconnects 8 Devavrat Shah and Pankaj Gupta
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
1 Fast Routing Table Lookup Based on Deterministic Multi- hashing Zhuo Huang, David Lin, Jih-Kwon Peir, Shigang Chen, S. M. Iftekharul Alam Department.
© 2009 Cisco Systems, Inc. All rights reserved. SWITCH v1.0—4-1 Implementing Inter-VLAN Routing Deploying Multilayer Switching with Cisco Express Forwarding.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Low Power TCAM Forwarding Engine for IP Packets Authors: Alireza Mahini, Reza Berangi, Seyedeh Fatemeh and Hamidreza Mahini Presenter: Yi-Sheng, Lin (
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
CS 268: Lectures 13/14 (Route Lookup and Packet Classification) Ion Stoica April 1/3, 2002.
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Timothy Whelan Supervisor: Mr Barry Irwin Security and Networks Research Group Department of Computer Science Rhodes University Hardware based packet filtering.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Packet Classifiers In Ternary CAMs Can Be Smaller Qunfeng Dong (University of Wisconsin-Madison) Suman Banerjee (University of Wisconsin-Madison) Jia Wang.
Applied Research Laboratory Edward W. Spitznagel 24 October Packet Classification using Extended TCAMs Edward W. Spitznagel, Jonathan S. Turner,
Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
IP Address Lookup Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Ann Gordon-Ross and Frank Vahid*
Dynamic FPGA Routing for Just-in-Time Compilation
A Self-Tuning Configurable Cache
Dynamic Hardware/Software Partitioning: A First Approach
Automatic Tuning of Two-Level Caches to Embedded Applications
MEET-IP Memory and Energy Efficient TCAM-based IP Lookup
Presentation transcript:

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship

ARM7 MEM DMA On-chip Minimizer MEM Proc. I$ D$ System-On-Chip Introduction (On-chip Logic Minimization) Indicate Completion 3 Execute Minimizer2 Initialize Minimizer1

On-Chip Minimization Applications (IP Routing Table Reduction) Port 7 Port 3125.x.x.x Port x.x Port x Prefix Next hop Incoming IP packet Destination IP Longest Prefix Match Lookup IP in Routing Table IP routing table reduction Routing tables of large network routers have over 30,000 entries Fast IP routing lookup is difficult without using large hardware resources Ternary CAM (McAuley & Francis, 1993) TCAM can be used to perform routing table lookup in single cycle Requires large resources and large power consumption Mask Extension (Liu, 2002) Uses two-level logic minimization to reduce the size of the routing table Good results but did not considering off-chip communication

On-Chip Minimization Applications (Access Control List Reduction) Access Control List (ACL) Used to restrict IP traffic through network routers ACL size can range anywhere from from 300 (UCR CS&E Dept.) to 10,000 (AOL) Common use is to block a particular protocol or port number to avoid attacks such as Denial of Service attacks ACL Minimization Similar approach as used for IP routing table reduction However, order of the list must be preserved TypeProtocolIn IPOut PortIn PortOut IPAction ACL Input Format

On-Chip Minimization Applications (Dynamic Hardware/Software Partitioning) Dynamic hardware/software partitioning (JIT compilation for FPGAs) Dynamically detects frequently executed loop and re- implements the software loops using on-chip configurable logic Requires logic synthesis tools to embedded on-chip Warp Processor MIPS/ ARM I$ D$ Profiler Configurable Logic Warp Processor Dynamic Partitioning Module Warp Processor

ROCM On-chip Logic Minimization Requirements Limited data and instruction memory available Quality of results must still be close to optimal Execution time should remain reasonable On-chip Logic Minimization Goal Focus on developing an on-chip logic minimization tool that produces acceptable results with reasonable increases in execution time while using limited memory resources ROCM – Riverside On-Chip Minimizer Two-level minimization tool Utilized a combination of approaches from Espresso-II (Brayton, et al. 1984) and Presto (Svoboda & White, 1979) Eliminate the need to computer the off-set to reduce memory usage Utilizes a single expand phase instead of multiple iterations On average only 2% larger than optimal solution

ROCM Results (Performance/Memory Usage) 500 MHz Sun Ultra60 40 MHz ARM 7 (Triscend A7) ROCM executing on 40MHz ARM7 requires less than 1 second Small code size of only 22 kilobytes Average data memory usage of only 1 megabyte

Codesign ROCM (Hardware Coprocessor) Customized ROCM enables us to develop an efficient hardware coprocessor Profiled the execution of ROCM-32 and ROCM-128 using ARM port of the SimpleScalar simulator Determine critical loops/functions that are suitable for implementation in hardware Identified six critical kernels that comprised 91% of the total execution time but only 2% of the code size

Codesign ROCM (Minimization Coprocessor) MEM ARM7 Min. Coproc. Min. Coproc. Proc/Mem Interface DoesInter IsCov GetLit SetLit Tautology.1 Cofactor.1 data addr Minimization Coprocessor On-Chip Minimizer

Codesign ROCM (Minimization Coprocessor) Proc/Mem Interface DoesInter Does Intersect IsCov GetLit SetLit Tautology.1 Cofactor.1 data addr Minimization Coprocessor aImpl dImplnumLits << << 1 32 (odd) (even) == 0 retVal DoesIntersect

Codesign ROCM Results (Execution Time) Average speedup of 7.8

Codesign ROCM Results (Energy Consumption) Average energy reduction of 59.2%

Codesign ROCM (Minimization Coprocessor) Software modifications were required to achieve speedup of 7.8 Data structures/algorithms not suitable for hardware implementation Reorganized data structures Customized width of data items Eliminate memory allocation within critical regions Not automated with current hardware/software partitioning tools

Codesign ROCM (Minimization Coprocessor) for(i=0; i numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; for(k=0; k numLiterals; k++) { // determine coImplicant... } AddImplicant(cofactor, &coImplicant); } Move to HW 28.5% of total exec. time Original C Code Only 3.5% of total exec. time Requires dynamic memory allocation AddImplicant(cofactor, &coImplicant);

Codesign ROCM (Minimization Coprocessor) // determine size of cofactor initially cofactorSize = 0; for(i=0; i numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; cofactorSize++; } // allocate all memory outside of main loop cofactor->implicants = malloc(…); for(i=0; i numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; for(k=0; k numLiterals; k++) { // additional initialization code need for each iterations coImplicant = &(cofactor->implicants[index++]);... } Modified C Code // determine size of cofactor initially // allocate all memory outside of main loop // additional initialization code need for each iterations

Conclusions & Future Work Developed codesigned on-chip logic minimization Performance improvement of nearly 8X compared to earlier software only implementation Energy reduction of almost 60% New directions in hardware/software partitioning Designer effort was required to rewrite algorithms and fine tune data structures Could better hardware/software partitioning tools automate this?