Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Digitally-Bypassed Transducers: Interfacing Digital Mockups to Real-Time Medical Equipment Scott Sirowy*, Tony Givargis and Frank Vahid* This work was.

Experiments with the Peripheral Virtual Component Interface Roman L. Lysecky, Frank Vahid*, Tony D. Givargis Dept. of Computer Science & Engineering University.

RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Processor System Architecture

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.

Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis.

Conversion Between Video Compression Protocols Performed by: Dmitry Sezganov, Vitaly Spector Instructor: Stas Lapchev, Artyom Borzin Cooperated with:

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Chapter 12 Pipelining Strategies Performance Hazards.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Dept. of Computer Science & Engineering University.

Data Manipulation Computer System consists of the following parts:

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

System-level Exploration for Pareto- optimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

CISC and RISC L1 Prof. Sin-Min Lee Department of Mathematics and Computer Science.

I/O Subsystem Organization and Interfacing Cs 147 Peter Nguyen

Tony GivargisUniversity of California, Riverside & NEC USA1 Fast Cache and Bus Power Estimation for Parameterized System-on-a-Chip Design Tony D. Givargis.

6/30/2015HY220: Ιάκωβος Μαυροειδής1 Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

More Basics of CPU Design Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.

Prince Sultan College For Woman

ARM Processor Architecture

C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.

Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.

Arpit Jain Mtech1. Outline Introduction Dalvik VM Java VM Examples Comparisons Experimental Evaluation.

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Microcontroller based system design Asst. Prof. Dr. Alper ŞİŞMAN.

Computer Design Basics

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

Reducing Test Application Time Through Test Data Mutation Encoding Sherief Reda and Alex Orailoglu Computer Science Engineering Dept. University of California,

Modes of transfer in computer

L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.

AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.

IT3002 Computer Architecture

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.

Chapter Microcontroller

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Nios II Processor: Memory Organization and Access

Techniques for Reducing Read Latency of Core Bus Wrappers

Ann Gordon-Ross and Frank Vahid*

Course Outline for Computer Architecture

Virtual Memory: Working Sets

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel Department of Computer Science University of California Riverside, CA {rlysecky, vahid, givargis, This work was supported in part by the NSF and a DAC scholarship.

Roman LyseckyUniversity of California, Riverside2 Introduction Core Library MIPS MEM Cache DSP DMA Core XCore Y Core-based designs are becoming common –available as both soft and hard Problem - How can interfacing be simplified to ease integration?

Roman LyseckyUniversity of California, Riverside3 Introduction One Solution - One standard on-chip bus –All cores have same interface –Appears to be unlikely (VSIA) Another Solution - Divide core into a bus wrapper and internal parts –Rowson and Sangiovanni-Vincentelli ‘97 - Interface-Based Design –VSIA developing standard for interface between wrapper and internals Far simpler than standard on-chip bus –Refer to bus wrapper as an interface module(IM)

Roman LyseckyUniversity of California, Riverside4 Introduction Problem - Using an Interface Module can result in extra cycles for reads Pre-fetching can reduce or eliminate extra cycles Outline –Interfacing Options –Classification of registers and common registers occurrences –Architecture of IM and pre-fetch heuristics –Experiments –Conclusions

Roman LyseckyUniversity of California, Riverside5 No Interface Module(IM) Interface logic is designed as part of the core’s internal logic Pros –Small Size –High Performance (No Overhead) Cons –May be hard to integrate with different busses

Roman LyseckyUniversity of California, Riverside6 Separating a Core into IM & Internals Interface module is separate from core internal –Standard bus between IM and internals Pros –Easily integrate with different busses –Any changes are restricted to the IM Cons –May incur performance overhead due to the interface module –Possible increases in size and power

Roman LyseckyUniversity of California, Riverside7 Proposed Solution - Pre-fetching in IM Pre-fetching –Analogous to caching, store local copies of registers inside the interface module –Enable quick response time –Eliminates extra cycles for register reads –Transparent to system bus and core internals Pros –Easily integrate with different busses –No performance overhead Cons –Possible increases in size and power

Roman LyseckyUniversity of California, Riverside8 Classification of Core Registers Different registers need different pre-fetch scheme Need classification for registers –Update Type –Access Type –Notification Type –Structure Type

Roman LyseckyUniversity of California, Riverside9 Common Register Types We identified three common register combinations found in cores –Configuration, Task, and Input-buffered registers –Implemented cores representative of each of these three common register combinations –Provide classification for registers in each of the cores

Roman LyseckyUniversity of California, Riverside10 Common Register Types Core1 - Configuration Registers –Example: Configuration registers in a UART or DMA Controller Configuration Register(D)

Roman LyseckyUniversity of California, Riverside11 Common Register Types Core2 - Task Registers –Example: JPEG or MPEG CODEC, or DES Encryption Data Input Register(DI) Data Output Register(DO) Status Register(S)

Roman LyseckyUniversity of California, Riverside12 Common Register Types Core3 - Input-buffered Registers –Example: FIFO or UART Status Register(S) Data Register(D)

Roman LyseckyUniversity of California, Riverside13 Architecture of IM pre-fetch registers Pre-fetch Unit - Implements the pre- fetching heuristic Goal: maximize the number of hits Controller - Interfaces to system bus

Roman LyseckyUniversity of California, Riverside14 Pre-fetch Heuristic for Core2 Core2 - Task Register –After system writes to register DI Read S into pre-fetch register S’ When S indicates completion, read DO from core into pre-fetch register DO’ –Repeat this process Similar heuristics were developed for Core1 and Core3

Roman LyseckyUniversity of California, Riverside15 Experiments - Area(Gates) Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size. Average increase of IM w/o PF over no IM of 1.4K gates Average increase of IM w/ PF over IM w/o PF of 1.3K gates

Roman LyseckyUniversity of California, Riverside16 Experiments - Performance(ns)

Roman LyseckyUniversity of California, Riverside17 Experiments - Energy(nJ)

Roman LyseckyUniversity of California, Riverside18 Digital Camera Peripheral Read Access(cycles) 12% of execution time for peripheral reads 50% decrease in peripheral read access 25% decrease in overall peripheral access 3.2% improvement in overall system performance

Roman LyseckyUniversity of California, Riverside19 Conclusion Separating interface from internals eases core integration but may yield increase in read cycles Pre-fetching eliminated the performance degradation in common cases –Increases in size and power were acceptable –Transparent to system bus and core internals –Pre-fetching thus improves the marketability of cores