Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Categories of I/O Devices
CSCI 4717/5717 Computer Architecture
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
44 nd DAC, June 4-8, 2007 Processor External Interrupt Verification Tool (PEVT) Fu-Ching Yang, Wen-Kai Huang and Ing-Jer Huang Dept. of Computer Science.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Edoclite and Managing Client Engagements What is Edoclite? How is it used at IU? Development Process?
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Superscalar SMIPS Processor Andy Wright Leslie Maldonado.
Advances in Language Design
Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Fetch Directed Prefetching - a Study
Performed By: Yahel Ben-Avraham and Yaron Rimmer Instructor: Mony Orbach Semesterial (possibly bi-semesterial) Winter /12/2012.
EKT303/4 Superscalar vs Super-pipelined.
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Constructive Computer Architecture Tutorial 4: Running and Debugging SMIPS Andy Wright TA October 10, 2014http://csg.csail.mit.edu/6.175T04-1.
Introduction to Computer Organization Pipelining.
Review for Quiz-1 Applied Operating System Concepts Patterson & Hennessy Chap.s 1,2,6,7 ECE3055b, Spring 2005
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Full Design. DESIGN CONCEPTS The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.
IMPLEMENTING RISC MULTI CORE PROCESSOR USING HLS LANGUAGE - BLUESPEC LIAM WIGDOR INSTRUCTOR MONY ORBACH SHIREL JOSEF Winter 2013 One Semester Mid-term.
6.175 Final Project Part 0: Understanding Non-Blocking Caches and Cache Coherency Answers.
Overview Parallel Processing Pipelining
Design-Space Exploration
CSC 4250 Computer Architectures
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
Constructive Computer Architecture Tutorial 6: Discussion for lab6
Pipeline Implementation (4.6)
Constructive Computer Architecture: Lab Comments & RISCV
Constructive Computer Architecture Tutorial 7 Final Project Overview
Introduction to Operating Systems
Guest Lecturer TA: Shreyas Chand
Control Hazards Constructive Computer Architecture: Arvind
Pipelining: Basic Concepts
Operating System Overview
Pipelining.
Presentation transcript:

Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter 2013 Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology

AGENDA Introduction BlueSpec Development Environment Project’s Goals Project’s Requirements Design Overview Design Stage 1 – Instruction Memory Design Stage 2 – Data Memory Design Stage 3 – MultiCore The Scalable Processor Benchmarks & Results Summary & Conclusion

Introduction The future of single core is gloomy Multi cores can be used for parallel computing Multi cores may be used as specific accelerators as well as general purpose core. Ecclesiastes 4: Two are better than one

BlueSpec Development Environment High level language for hardware description Rules – describing dynamic behavior –Atomic, fires once per cycle –Can run concurrently if not conflicted –Scheduled by BlueSpec automatically Module - Same as object in an object-oriented language. Interface – A module can be manipulated via the methods of its interface. Interface can be used by parent module only!

Project’s Goals Main Goal: –Implementing RISC multi core processor using BlueSpec. –Evaluate and analyze multi core design performance compared to single core. Derived Goals: –Learning the BlueSpec principles, syntax and working environment. –Understanding and using single core RISC processor to implement multi core processor. –Validate design in BlueSpec level by using simple benchmark programs and evaluating performance to single core.

Project’s Requirements Scalable Architecture: The architecture does not depend on the number of cores. Shared data memory Core 1Core 2 Shared Data Memory Single Core Dual CoreQuadratic Core

Baseline Processor – Single Core The SMIPS BlueSpec code taken from Architecting and Implementing Microprocessors in BlueSpec 2 Stage Pipeline Data and Instruction memory as sub modules Includes naïve branch predictor

Design Overview In order to achieve project’s goals our design consisted of 3 stages: –Stage 1 – Instruction memory –Stage 2 – Data memory –Stage 3 – multicore

Stage 1 – Instruction Memory Motiviation Each core execute different instructions. Can’t be achieved with I.Mem as CPU’s sub module. Solution: Draw out the I.Mem to the same hierarchy as the CPU module. Core 1 D Mem I Mem

Modules use get/put Interface. (CPU as client, memory as Server) connect_resps and connect_reqs rules use CPU and I.Mem interfaces in order to connect the requests and the responses. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Implementation Methods

Problem: Fetching Instruction latency is 5 cycles Cycle 1: –CPU rule enqueue the PC address into memory request to f_out. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 1

Problem: Fetching Instruction latency is 5 cycles Cycle 2: –connect_reqs dequeue the request from CPU f_out and enqueue it into I.mem f_in fifo. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 2

Problem: Fetching Instruction latency is 5 cycles Cycle 3: –I.Mem dequeue the request from f_in, process it and enqueue the response to f_out Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 3

Problem: Fetching Instruction latency is 5 cycles Cycle 4: –connect_resps dequeue the response from I.mem f_out and enqueue it into CPU f_in fifo. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 4

Problem: Fetching Instruction latency is 5 cycles Cycle 5: –CPU rule dequeue the response from f_in and process it. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 5

Solution: Using bypass fifo for f_in and f_out instead of regular fifo, allowing enqueue and dequeue in the same cycle. New latency: 1 Cycle doFetch execute after response arrives. Stage 1 – Instruction Memory Solution – Overview Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps

Each core access the same data memory to achieve parallelism Can’t be achieved with D.Mem as CPU’s sub module. Solution: Draw out the D.Mem to the same hierarchy as the CPU module. Core 1 D Mem I Mem Stage 2 – Data Memory Motivation

Modules use get/put Interface. (CPU as client, memory as Server) dconnect_resps and dconnect_reqs rules use CPU and D.Mem interfaces in order to connect the requests and the responses. Stage 2 – Data Memory Implementation method

Rule can only fire once per cycle. doExecute initiate memory operation and process the response and cannot fire twice in the same cycle. Stage 2 – Data Memory Latency Problem

Solution: –Add memory stage in the pipeline data path, requesting data in the execution Stage and receiving it in the memory stage. This solution was not implemented as we focused on creating multi-core processor. Stage 2 – Data Memory suggested solution

Connecting multiple cores to their instruction memory and the shared data memory. Higher hierarchy module must be created in order to establish these connections. Core 1Core 2 Shared Data Memory I Mem 1I Mem 2 Core 1 D Mem I Mem Stage 3 – Multi-Core processor

Connections are established using dconnect_reqs and dconnect_resps between each core to the same data memory Stage 3 – Multi-Core processor Implementation method

Issue 1: D.Mem has only one port, How can memory access be scheduled? Solution: BlueSpec automaticaly schedule design rules execution, giving priority to lower numbered core. Stage 3 – Multi-Core processor Issue 1 - Scheduling

Issue 2: –Connection rules constantly try to fire. –Need to ensure that the CPU which accessed the memory will obtain the response and not other core. Stage 3 – Multi-Core processor Issue 2 – Response Path

Issue 3: –When simulating the processor 2 cores were unable to operate together resulting poor performance. Stage 3 – Multi-Core processor Issue 3 – Performance

–Using BlueSpec tools we observed that dconnect_resps_core2 blocked by dconnect_resps_core1 –Therefore, core2 execute stage was blocked when core1 operated. Stage 3 – Multi-Core processor Issue 3 – debugging

–get_response interface in D.Mem was: –Due to f_out.deq, only one core could obtain response and blocked all other cores because D.Mem f_out fifo was empty. –get_response interface changed to: Stage 3 – Multi-Core processor Issue 2,3 – Solution

–step 1: sendMessage enqueue the response which was prepared in the previous cycle put_request.put get_response.get f_out f_in Rule: dconnect_reqs Rule: dconnect_resps D mem Rule: sendMessage Rule: dMemoryResponse Stage 3 – Multi-Core processor Change in D.mem – step 1

–step 2: the connection use fifo.first (do not dequeue f_out) and new request arrives put_request.put get_response.get f_out f_in Rule: dconnect_reqs Rule: dconnect_resps D mem Rule: sendMessage Rule: dMemoryResponse Stage 3 – Multi-Core processor Change in D.mem – step 2

–step 3: dMemoryResponse prepare the new response and dequeue the response that was sent in the beginning of the cycle put_request.put get_response.get f_out f_in Rule: dconnect_reqs Rule: dconnect_resps D mem Rule: sendMessage Rule: dMemoryResponse Stage 3 – Multi-Core processor Change in D.mem – step 3

Two cores executing instructions simultaneously, sharing the same data memory Stage 3 – Multi-Core processor Parallel execution

The Scalable Processor 3 easy steps are required to add cores: –Step 1: Creating new instruction memory –Step 2: Connecting cores to data and instruction memories. –Step 3: Adding monitoring mechanism for each core Architectural independency in number of cores

Benchmark 1 – Description Benchmark 1 – pure computational program –No memory instructions –Pure parallelism due to no blocking

Benchmark 1 – Results Benchmark 1 – pure computational program –Results: –With no memory instructions, all cores working independently and simultaneously. – the results match the concept of multi-core as 8 cores can do the same “job” as 1 core in 1/8 of the time.

Benchmark 2 – Description Benchmark 2 – Short Image Processing –Input: 32X32 binary image –Output: inverted image –Using memory instructions

Benchmark 2 – Example Benchmark 2 – Short Image Processing –Image processing result:

Benchmark 2 – Results Benchmark 2 – Short Image Processing –Results: –2 cores managed to multiply performance by 2 –4 cores and 8 cores improvement declined as can be predicted by the rule of diminishing marginal productivity. –The gap between the memory instruction was enough for 2 cores to operate with phase difference allowing each core to access the memory without blocking the other.

Benchmark 3/4 – Description Benchmark 3/4 – Pure memory accessing program –Mostly SW instructions or LW instruction –SW is “fire and forget” instruction, however load instruction wait for response

Benchmarks 3/4 – Results Benchmark 3/4 – Pure memory accessing program –Results: –Single core allocate cycles to computation, therefore memory is idle. –In multiple cores, some cores execute computation instruction and others memory instructions allowing the maximize memory utilization.

Benchmark 5 – Description Benchmark 5 – Long Image Processing –Input: 32X32 binary image –Output: inverted image –Using memory instructions –However, processing part takes longer than benchmark2. -Motivation – Larger gap between memory instructions.

Benchmark 5 – Results Benchmark 5 – Long Image Processing –Results: –As predicted in benchmark 2, larger gape between memory instructions resulted in greater performance for quadratic core. –The larger the gap, more cores are capable to operate in different phase allowing them not to be blocked by other cores memory access.

Summary & Conclusion Design included 3 stages: –Stage 1 – Instruction memory –Stage 2 – Data memory –Stage 3 – Multi core Scalable and shared data memory requirements achieved. MultiCore increase data memory utilization (shown in benchmark 3/4)

Summary & Conclusion the number of cores should be chosen with regards to executed program Using mutlicore processor can enhance performance but after certain number of cores adding more cores will not result in better performance.

Summary & Conclusion - BlueSpec Pros: –High abstraction level of design – easier to focus on goal. –Automatic scheduling of modules interactions. –High level language – more human readable. Cons: –Hard to optimize – understanding the automatic scheduling mechanism takes time. –Decipher scheduling errors and warnings. –Lack of “knowledge-base”.

Summary & Conclusion - FAQ Problem: Each core execute same instructions Solution: Draw out the I.Mem to the same hierarchy as the CPU module. Problem: Client/Server interface latency is 5 cycles. Solution: Use bypass fifo instead of regular fifo. load instructions latency cannot be 1 cycle even when using bypass fifo. Solution: Add memory stage in the pipeline data path, requesting data in the execution Stage and receiving it in the memory stage. (Not implemented)

Summary & Conclusion - FAQ Problem: D.Mem has only one port, How can memory access be scheduled? Solution: BlueSpec automatically schedule design rules execution, giving priority to lower numbered core. Problem: Need to ensure that the CPU which accessed the memory will obtain the response and not other core. Solution: Change interface so that every core can receive and validate response possession.

Future Projects Possibilities: what’s next: MultiCore 2.0 –Design’s verification on hardware. –Adding memory stage to reduce load latency. Send request in execute stage and receive response in memory stage. –Implement cache to reduce memory access. –Implement multiple port data memory. –Design mechanism for memory coherence.

As BlueSpec alluring advertisement says: