Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter.

Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter 2013 Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology

AGENDA Introduction BlueSpec Development Environment Project’s Goals Project’s Requirements Design Overview Design Stage 1 – Instruction Memory Design Stage 2 – Data Memory Design Stage 3 – MultiCore The Scalable Processor Benchmarks & Results Summary & Conclusion

Introduction The future of single core is gloomy Multi cores can be used for parallel computing Multi cores may be used as specific accelerators as well as general purpose core. Ecclesiastes 4:9-12 - Two are better than one

BlueSpec Development Environment High level language for hardware description Rules – describing dynamic behavior –Atomic, fires once per cycle –Can run concurrently if not conflicted –Scheduled by BlueSpec automatically Module - Same as object in an object-oriented language. Interface – A module can be manipulated via the methods of its interface. Interface can be used by parent module only!

Project’s Goals Main Goal: –Implementing RISC multi core processor using BlueSpec. –Evaluate and analyze multi core design performance compared to single core. Derived Goals: –Learning the BlueSpec principles, syntax and working environment. –Understanding and using single core RISC processor to implement multi core processor. –Validate design in BlueSpec level by using simple benchmark programs and evaluating performance to single core.

Project’s Requirements Scalable Architecture: The architecture does not depend on the number of cores. Shared data memory Core 1Core 2 Shared Data Memory Single Core Dual CoreQuadratic Core

Baseline Processor – Single Core The SMIPS BlueSpec code taken from 046004 - Architecting and Implementing Microprocessors in BlueSpec 2 Stage Pipeline Data and Instruction memory as sub modules Includes naïve branch predictor

Design Overview In order to achieve project’s goals our design consisted of 3 stages: –Stage 1 – Instruction memory –Stage 2 – Data memory –Stage 3 – multicore

Stage 1 – Instruction Memory Motiviation Each core execute different instructions. Can’t be achieved with I.Mem as CPU’s sub module. Solution: Draw out the I.Mem to the same hierarchy as the CPU module. Core 1 D Mem I Mem

Modules use get/put Interface. (CPU as client, memory as Server) connect_resps and connect_reqs rules use CPU and I.Mem interfaces in order to connect the requests and the responses. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Implementation Methods

Problem: Fetching Instruction latency is 5 cycles Cycle 1: –CPU rule enqueue the PC address into memory request to f_out. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 1

Problem: Fetching Instruction latency is 5 cycles Cycle 2: –connect_reqs dequeue the request from CPU f_out and enqueue it into I.mem f_in fifo. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 2

Problem: Fetching Instruction latency is 5 cycles Cycle 3: –I.Mem dequeue the request from f_in, process it and enqueue the response to f_out Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 3

Problem: Fetching Instruction latency is 5 cycles Cycle 4: –connect_resps dequeue the response from I.mem f_out and enqueue it into CPU f_in fifo. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 4

Problem: Fetching Instruction latency is 5 cycles Cycle 5: –CPU rule dequeue the response from f_in and process it. Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps Stage 1 – Instruction Memory Latency problem – cycle 5

Solution: Using bypass fifo for f_in and f_out instead of regular fifo, allowing enqueue and dequeue in the same cycle. New latency: 1 Cycle doFetch execute after response arrives. Stage 1 – Instruction Memory Solution – Overview Test Bench I mem put_request.put get_response.get CPU get_request.get put_response.put f_out f_inf_out f_in Rule: connect_reqs Rule: connect_resps

Each core access the same data memory to achieve parallelism Can’t be achieved with D.Mem as CPU’s sub module. Solution: Draw out the D.Mem to the same hierarchy as the CPU module. Core 1 D Mem I Mem Stage 2 – Data Memory Motivation

Modules use get/put Interface. (CPU as client, memory as Server) dconnect_resps and dconnect_reqs rules use CPU and D.Mem interfaces in order to connect the requests and the responses. Stage 2 – Data Memory Implementation method

Rule can only fire once per cycle. doExecute initiate memory operation and process the response and cannot fire twice in the same cycle. Stage 2 – Data Memory Latency Problem

Solution: –Add memory stage in the pipeline data path, requesting data in the execution Stage and receiving it in the memory stage. This solution was not implemented as we focused on creating multi-core processor. Stage 2 – Data Memory suggested solution

Connecting multiple cores to their instruction memory and the shared data memory. Higher hierarchy module must be created in order to establish these connections. Core 1Core 2 Shared Data Memory I Mem 1I Mem 2 Core 1 D Mem I Mem Stage 3 – Multi-Core processor

Connections are established using dconnect_reqs and dconnect_resps between each core to the same data memory Stage 3 – Multi-Core processor Implementation method

Issue 1: D.Mem has only one port, How can memory access be scheduled? Solution: BlueSpec automaticaly schedule design rules execution, giving priority to lower numbered core. Stage 3 – Multi-Core processor Issue 1 - Scheduling

Issue 2: –Connection rules constantly try to fire. –Need to ensure that the CPU which accessed the memory will obtain the response and not other core. Stage 3 – Multi-Core processor Issue 2 – Response Path

Issue 3: –When simulating the processor 2 cores were unable to operate together resulting poor performance. Stage 3 – Multi-Core processor Issue 3 – Performance

–Using BlueSpec tools we observed that dconnect_resps_core2 blocked by dconnect_resps_core1 –Therefore, core2 execute stage was blocked when core1 operated. Stage 3 – Multi-Core processor Issue 3 – debugging

–get_response interface in D.Mem was: –Due to f_out.deq, only one core could obtain response and blocked all other cores because D.Mem f_out fifo was empty. –get_response interface changed to: Stage 3 – Multi-Core processor Issue 2,3 – Solution

–step 1: sendMessage enqueue the response which was prepared in the previous cycle put_request.put get_response.get f_out f_in Rule: dconnect_reqs Rule: dconnect_resps D mem Rule: sendMessage Rule: dMemoryResponse Stage 3 – Multi-Core processor Change in D.mem – step 1

–step 2: the connection use fifo.first (do not dequeue f_out) and new request arrives put_request.put get_response.get f_out f_in Rule: dconnect_reqs Rule: dconnect_resps D mem Rule: sendMessage Rule: dMemoryResponse Stage 3 – Multi-Core processor Change in D.mem – step 2

–step 3: dMemoryResponse prepare the new response and dequeue the response that was sent in the beginning of the cycle put_request.put get_response.get f_out f_in Rule: dconnect_reqs Rule: dconnect_resps D mem Rule: sendMessage Rule: dMemoryResponse Stage 3 – Multi-Core processor Change in D.mem – step 3

Two cores executing instructions simultaneously, sharing the same data memory Stage 3 – Multi-Core processor Parallel execution

The Scalable Processor 3 easy steps are required to add cores: –Step 1: Creating new instruction memory –Step 2: Connecting cores to data and instruction memories. –Step 3: Adding monitoring mechanism for each core Architectural independency in number of cores

Benchmark 1 – Description Benchmark 1 – pure computational program –No memory instructions –Pure parallelism due to no blocking

Benchmark 1 – Results Benchmark 1 – pure computational program –Results: –With no memory instructions, all cores working independently and simultaneously. – the results match the concept of multi-core as 8 cores can do the same “job” as 1 core in 1/8 of the time.

Benchmark 2 – Description Benchmark 2 – Short Image Processing –Input: 32X32 binary image –Output: inverted image –Using memory instructions

Benchmark 2 – Example Benchmark 2 – Short Image Processing –Image processing result:

Benchmark 2 – Results Benchmark 2 – Short Image Processing –Results: –2 cores managed to multiply performance by 2 –4 cores and 8 cores improvement declined as can be predicted by the rule of diminishing marginal productivity. –The gap between the memory instruction was enough for 2 cores to operate with phase difference allowing each core to access the memory without blocking the other.

Benchmark 3/4 – Description Benchmark 3/4 – Pure memory accessing program –Mostly SW instructions or LW instruction –SW is “fire and forget” instruction, however load instruction wait for response

Benchmarks 3/4 – Results Benchmark 3/4 – Pure memory accessing program –Results: –Single core allocate cycles to computation, therefore memory is idle. –In multiple cores, some cores execute computation instruction and others memory instructions allowing the maximize memory utilization.

Benchmark 5 – Description Benchmark 5 – Long Image Processing –Input: 32X32 binary image –Output: inverted image –Using memory instructions –However, processing part takes longer than benchmark2. -Motivation – Larger gap between memory instructions.

Benchmark 5 – Results Benchmark 5 – Long Image Processing –Results: –As predicted in benchmark 2, larger gape between memory instructions resulted in greater performance for quadratic core. –The larger the gap, more cores are capable to operate in different phase allowing them not to be blocked by other cores memory access.

Summary & Conclusion Design included 3 stages: –Stage 1 – Instruction memory –Stage 2 – Data memory –Stage 3 – Multi core Scalable and shared data memory requirements achieved. MultiCore increase data memory utilization (shown in benchmark 3/4)

Summary & Conclusion the number of cores should be chosen with regards to executed program Using mutlicore processor can enhance performance but after certain number of cores adding more cores will not result in better performance.

Summary & Conclusion - BlueSpec Pros: –High abstraction level of design – easier to focus on goal. –Automatic scheduling of modules interactions. –High level language – more human readable. Cons: –Hard to optimize – understanding the automatic scheduling mechanism takes time. –Decipher scheduling errors and warnings. –Lack of “knowledge-base”.

Summary & Conclusion - FAQ Problem: Each core execute same instructions Solution: Draw out the I.Mem to the same hierarchy as the CPU module. Problem: Client/Server interface latency is 5 cycles. Solution: Use bypass fifo instead of regular fifo. load instructions latency cannot be 1 cycle even when using bypass fifo. Solution: Add memory stage in the pipeline data path, requesting data in the execution Stage and receiving it in the memory stage. (Not implemented)

Summary & Conclusion - FAQ Problem: D.Mem has only one port, How can memory access be scheduled? Solution: BlueSpec automatically schedule design rules execution, giving priority to lower numbered core. Problem: Need to ensure that the CPU which accessed the memory will obtain the response and not other core. Solution: Change interface so that every core can receive and validate response possession.

Future Projects Possibilities: what’s next: MultiCore 2.0 –Design’s verification on hardware. –Adding memory stage to reduce load latency. Send request in execute stage and receive response in memory stage. –Implement cache to reduce memory access. –Implement multiple port data memory. –Design mechanism for memory coherence.

As BlueSpec alluring advertisement says:

Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter.

Similar presentations

Presentation on theme: "Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter.

Similar presentations

Presentation on theme: "Implementing RISC Multi Core Processor Using HLS Language – BLUESPEC Final Presentation Liam Wigdor Advisor Mony Orbach Shirel Josef Semesterial Winter."— Presentation transcript:

Similar presentations

About project

Feedback