Multicore CPU with Multi-Threading Operating Environment.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Chapter 8. Pipelining.
Avishai Wool lecture Introduction to Systems Programming Lecture 8 Input-Output.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
Chapter 1 and 2 Computer System and Operating System Overview
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.
Processor Architecture
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Antelope Progress MMU and Cache->Datapath interfaces almost complete Datapath moving slowly, control issues keep surfacing Hazard Detection/Prevention.
Playstation2 Architecture Architecture Hardware Design.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
What’s going on here? Can you think of a generic way to describe both of these?
Input/Output (I/O) Important OS function – control I/O
Introduction to threads
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Applied Operating System Concepts
Chapter Objectives In this chapter, you will learn:
Control Unit Operation
William Stallings Computer Organization and Architecture 8th Edition
FileSystems.
Advanced OS Concepts (For OCR)
Lecture 16: Basic Pipelining
Assembly Language for Intel-Based Computers, 5th Edition
Morgan Kaufmann Publishers The Processor
Swapping Segmented paging allows us to have non-contiguous allocations
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.
Chapter 3 Top Level View of Computer Function and Interconnection
CS/COE0447 Computer Organization & Assembly Language
Lecture 5: GPU Compute Architecture
Local Video System: Overview
Figure 13.1 MIPS Single Clock Cycle Implementation.
Microprocessors Chapter 4.
2P13 Week 2.
Lecture 16: Basic Pipelining
Lecture 5: GPU Compute Architecture for the last time
Design of the Control Unit for One-cycle Instruction Execution
Simultaneous Multithreading in Superscalar Processors
Operating Systems Chapter 5: Input/Output Management
Direct Memory Access Disk and Network transfers: awkward timing:
Chapter 5: I/O Systems.
CSE451 - Section 10.
Chapter 14 Control Unit Operation
TI C6701 VLIW MIMD.
Chapter 8. Pipelining.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Why Threads Are A Bad Idea (for most purposes)
Local Video System: Overview
8085 Microprocessor Architecture
Last section! Project 4 + EC due tomorrow Today: Project 4 questions
Review: The whole processor
Why Threads Are A Bad Idea (for most purposes)
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:
Why Threads Are A Bad Idea (for most purposes)
CSE 153 Design of Operating Systems Winter 2019
COMP755 Advanced Operating Systems
Multicore and GPU Programming
Pipelined datapath and control
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Multicore CPU with Multi-Threading Operating Environment

Overview Team Goals Job Distribution Multicore IRQ WriteBackBuffer BlockRAM CLK4x Generic Memory Arbiter

Marcel Schaal Benedikt Weber Bastian Reitschuster Rudolf Netzel Team

Goals Dual core CPU WriteBackBuffer double Mandelbrot in split screen

Job Distribution Bastian + Rudolf WriteBackBuffer Interrupts Marcel + Benedikt Multicore CPU implementation OS + hase research

Problems Not enough Block RAMs Overclocking register file Number of read/write ports Timing is everything Clock skew and routing delay Only 24h a day Softwareenvironment

Multicore Architecture

WriteBackBuffer Why? Speedup Store-Instructions How? Buffer data and addresses in FIFO writing buffered elements into memory whenever possible

WBB implementation Using one BlockRAM for the WriteBackBuffer Two read and write ports needed Two read and write operations per cycle Internal read- and writepointer

WBB – first design

WBB - FSM

BlockRAM 2W2R Why? 2*32 bit data to write/read but one BlockRAM has only 2*16 bit How? Overclocking BlockRAM four times write A, write B, read A, read B

ERROR Because of limited BlockRAMs and even harder timing constraints Implementation as Distributed RAM with far less entries

BlockRAM 2R1W Why? Simple regfile needs four BlockRAMs Too many for more than four cores How? Again overclocking BlockRAM four times Register Input, Read A, Read B and Write

Generic Memory Arbiter Every core needs access to memory GMA handles memory request Generic in number of ports Similar to memory arbiter of task 2 needed for instruction fetch and load/store unit Round-Robin implementation

System stuff Only the master core can handle interrupts Every Core is able to execute other core Brancher handles execute-opcode (in/out)

Hase Adding new opcodes Don't let anybody see sourcecode Hase counts destination for labels wrong

Sample Video Please wait a moment

Surprise: Multi-Threading Why? Multiplicators need ~10% of boardarea SIMD doesn't utilize multiplicators enough Load-/Store-unit utilization < 1% (Mandelbrot) Less usage of boardarea than Multi-Core Still two weeks of time available It is possible so why not?

Features Generic number of cores, threads and SIMD synchronization unit (lock, unlock, access data) Shared multiplicators and load-/store unit Load-/store scheduler - similar to memory arbiters

Multi-Threading-Core

No final version Everything went wrong Too much timing problems 10ns are hard to achieve (somebody said impossible ) Two weeks are shorter than thought Strange behaviour of XST, Modelsim, boards Old working example doesn't work anymore

Thank You Questions?