Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis.

Slides:

Advertisements

Similar presentations

System Integration and Performance

Advertisements

Experiments with the Peripheral Virtual Component Interface Roman L. Lysecky, Frank Vahid*, Tony D. Givargis Dept. of Computer Science & Engineering University.

Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Instruction-Level Parallelism (ILP)

Computer Organization and Architecture

From HRT-HOOD to ADA95 Real-Time Systems Lecture 5 Copyright, 2001 © Adam Czajka.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.

Interrupts (contd..) Multiple I/O devices may be connected to the processor and the memory via a bus. Some or all of these devices may be capable of generating.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Chapter 12 Pipelining Strategies Performance Hazards.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Chapter 1 and 2 Computer System and Operating System Overview

Vacuum tubes Transistor 1948 –Smaller, Cheaper, Less heat dissipation, Made from Silicon (Sand) –Invented at Bell Labs –Shockley, Brittain, Bardeen ICs.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.

From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Operating Systems Lecture 1 Crucial hardware concepts review M. Naghibzadeh Reference: M. Naghibzadeh, Operating System Concepts and Techniques, iUniverse.

1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Prince Sultan College For Woman

Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

More Scheduling cs550 Operating Systems David Monismith.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

WCET Analysis for a Java Processor Martin Schoeberl TU Vienna, Austria Rasmus Pedersen CBS, Denmark.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

MICROPROCESSOR INPUT/OUTPUT

© 2004, D. J. Foreman 1 Memory Management. © 2004, D. J. Foreman 2 Building a Module -1  Compiler ■ generates references for function addresses may be.

Real-Time Embedded Software Synthesis 即時嵌入式軟體合成熊博安國立中正大學資訊工程學系民國九十年十一月廿九日.

Scheduling policies for real- time embedded systems.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Towards the Design of Heterogeneous Real-Time Multicore System m Yumiko Kimezawa February 1, 20131MT2012.

Company name KUAS HPDS A Realistic Variable Voltage Scheduling Model for Real-Time Applications ICCAD Proceedings of the 2002 IEEE/ACM international conference.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

1 Memory Design for Multi-Core System on Chip. 2 Introduction The DSP processor is optimized for extremely high performance for a specific kind of arithmetic-intensive.

Object-Oriented Design and Implementation of the OE-Scheduler in Real-time Environments Ilhyun Lee Cherry K. Owen Haesun K. Lee The University of Texas.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

A local search algorithm with repair procedure for the Roadef 2010 challenge Lauri Ahlroth, André Schumacher, Henri Tokola

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

CSCI1600: Embedded and Real Time Software Lecture 24: Real Time Scheduling II Steven Reiss, Fall 2015.

Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

Computer Architecture Lecture 25 Fasih ur Rehman.

Advanced Architectures

CSC 4250 Computer Architectures

Techniques for Reducing Read Latency of Core Bus Wrappers

Memory Management © 2004, D. J. Foreman.

CS 3410, Spring 2014 Computer Science Cornell University

Presentation transcript:

Roman LyseckyUniversity of California, Riverside1 Techniques for Reducing Read Latency of Core Bus Wrappers Roman L. Lysecky, Frank Vahid, & Tony D. Givargis Department of Computer Science University of California Riverside, CA {rlysecky, vahid, This work was supported in part by the NSF and a DAC scholarship.

Roman LyseckyUniversity of California, Riverside2 Introduction Core Library MIPS MEM Cache DSP DMA Core XCore Y Core-based designs are becoming common –available as both soft and hard Problem - How can interfacing be simplified to ease integration?

Roman LyseckyUniversity of California, Riverside3 Introduction One Solution - One standard on-chip bus –All cores have same interface –Appears to be unlikely (VSIA) Another Solution - Divide core into a bus wrapper and internal parts –Rowson and Sangiovanni-Vincentelli ‘97 - Interface-Based Design –VSIA developing standard for interface between wrapper and internals Far simpler than standard on-chip bus –Refer to bus wrapper as an interface module(IM)

Roman LyseckyUniversity of California, Riverside4 Previous Work - Pre-fetching Analogous to caching, store local copies of registers inside the interface module Enable quick response time Eliminates extra cycles for register reads Transparent to system bus and core internals Easily integrate with different busses No performance overhead Acceptable increases in size and power Pre-fetching was manually added to each core

Roman LyseckyUniversity of California, Riverside5 Previous Work - Architecture of IM pre-fetch registers Pre-fetch Unit - Implements the pre- fetching heuristic Goal: maximize the number of hits Controller - Interfaces to system bus How can we automate the design of the PFU?

Roman LyseckyUniversity of California, Riverside6 Outline “Real-time” Pre-fetching –Mapping to real-time scheduling Update Dependency Model –General Register Attributes –Petri Net model construction –Petri Net model refinement –Pre-fetch Scheduling Experiments Conclusions

Roman LyseckyUniversity of California, Riverside7 Real-time Pre-fetching A - Age Constraint = 4 B - Age Constraint = 6 Access-time Constraint = 2 Naïve Schedule More Efficient Schedule Age constraint –Number of cycles old data may be when read Access-time constraint –Maximum number of cycles a read access may take

Roman LyseckyUniversity of California, Riverside8 Real-time Pre-fetching Mapping to Real-time scheduling –Register -> Process –Internal bus -> Processor –Pre-fetch -> Process execution –Register age constraint -> Process period –Register Access-time constraint -> Process deadline –Pre-fetch time -> Process computation time Assume a pre-fetch requires 2 cycles

Roman LyseckyUniversity of California, Riverside9 Real-time Pre-fetching Cyclic Executive –Major cycle = time required to pre-fetch all registers –Minor cycle = rate at which highest priority process will be executed –Problems Sporadic writes All process periods must be multiples of the minor cycle Computationally infeasible for large register sets

Roman LyseckyUniversity of California, Riverside10 Real-time Pre-fetching Rate monotonic priority assignment –Register with smallest register age constraint will have the highest priority

Roman LyseckyUniversity of California, Riverside11 Real-time Pre-fetching Ci = Computation Time for register i Ai = Pre-fetch Time for register i Utilization-based schedulability test

Roman LyseckyUniversity of California, Riverside12 Real-time Pre-fetching Ri = Response Time for register i Ci = Computation Time for register i Ii = Maximum interference in interval [t, t+Ri) Response Time Analysis –Response of register I is defined as follows –Register set is schedulable if for each register the response time is less than or equal to its age constraint

Roman LyseckyUniversity of California, Riverside13 Real-time Pre-fetching Sporadic register writes –Writes to registers are sporadic –Take control of internal bus, thus delaying pre-fetching of registers Deadline monotonic priority –Register with smallest register access-time constraint will have the highest priority –Add a write register WR to register set Access-time constraint = Deadline Age constraint = maximum rate at which write will occur

Roman LyseckyUniversity of California, Riverside14 Experiments - Area(Gates) Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size. Average increase of IM w/ RTPF over IM w/ BW of 1.4K gates

Roman LyseckyUniversity of California, Riverside15 Experiments - Performance(ns)

Roman LyseckyUniversity of California, Riverside16 Experiments - Energy(nJ)

Roman LyseckyUniversity of California, Riverside17 Register Attributes –Update type, access type, notification type, and structure type Update dependencies –Internal dependencies dependencies between registers –External dependencies updates to register via reads and writes from on-chip bus updates from external ports to internal core register Petri Nets –Determined that we could use Petri Nets to model our update dependencies

Roman LyseckyUniversity of California, Riverside18 Petri Net Based Dependency Model Bus Place Random Transition Register Places Update Dependencies

Roman LyseckyUniversity of California, Riverside19 Refined Petri Net Model Data Dependency Refined Transition

Roman LyseckyUniversity of California, Riverside20 Pre-fetch Schedule Create a heap registers to be pre-fetched Create a list for update arcs Repeat –if request detected then add outgoing arcs to heap set write register access-time to 0 and add to heap –if read request detected then add outgoing arcs to update arc list –for register at top of heap do if access-time = 0 then pre-fetch register, remove from heap if current age = 0 then pre-fetch register, reset current age, add register to heap –while update arcs list is not empty do if transition fires then set register’s access-time to 0 and add to heap

Roman LyseckyUniversity of California, Riverside21 Experiments - Area(Gates) Note: To better evaluate the effects of IM’s, our cores were kept simple, thus resulting in a smaller than normal size. Average increase of IM w/ PF over IM w/ BW of 1.5K gates Average increase of IM w/ PF over IM w/ RTPF of.1K gates

Roman LyseckyUniversity of California, Riverside22 Experiments - Performance(ns)

Roman LyseckyUniversity of California, Riverside23 Experiments - Energy(nJ)

Roman LyseckyUniversity of California, Riverside24 Conclusions Real-time pre-fetching and update dependency pre- fetching produce good results Update dependency model is more efficient in pre- fetching registers Two approaches are complementary Enable the automatic generation of pre-fetching unit