MPOC “Many Processors, One Chip”

Slides:

Advertisements

Similar presentations

Advertisements

Computer Organization and Architecture

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Final Project : Pipelined Microprocessor Joseph Kim.

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Lecture 12 Reduce Miss Penalty and Hit Time

1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Lab Assignment 2: MIPS single-cycle implementation

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

Pipelining and Parallelism Mark Staveley

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

1 3 Computing System Fundamentals 3.2 Computer Architecture.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Chapter Overview General Concepts IA-32 Processor Architecture

MPOC: A Chip Multiprocessor for Embedded Systems

The CPU, RISC and CISC Component 1.

Chapter 10: Computer systems (1)

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

CSC 4250 Computer Architectures

Morgan Kaufmann Publishers

ELEN 468 Advanced Logic Design

The University of Adelaide, School of Computer Science

Introduction to Computer Architecture

Assembly Language for Intel-Based Computers, 5th Edition

THE CPU i Bytes 1.1.

5.2 Eleven Advanced Optimizations of Cache Performance

Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.

CS1251 Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

Basic Computer Organization

Lu Peng, Jih-Kwon Peir, Konrad Lai

\course\cpeg323-08F\Topic6b-323

Introduction to Pentium Processor

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Instruction Level Parallelism and Superscalar Processors

Introduction to Computer Architecture

\course\cpeg323-05F\Topic6b-323

CPU Key Revision Points.

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Fundamental Concepts Processor fetches one instruction at a time and perform the operation specified. Instructions are fetched from successive memory locations.

* From AMD 1996 Publication #18522 Revision E

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Chapter Four The Processor: Datapath and Control

Computer Architecture

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Microprocessor Architecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Little work is accurate

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

MPOC “Many Processors, One Chip” presented by N. Özlem ÖZCAN

AGENDA Introduction Project Team Design Aims 4-stage Pipeline Memory Result

INTRODUCTION Project at Hewlett Packard's Palo Alto Research Lab Between years 1998 and 2001 Stands for “Many processors, one chip”

INTRODUCTION A single-chip community of identical high-speed RISC processors surrounding a large common storage area Each processor has its own clock, cache and program counter In order to reach min. power consumption, each processor is to be small and simple, but run very fast using that min. power requirements.

PROJECT TEAM Stuart Siu, circuit design Stephen Richardson, logic design and project management Gary Vondran, Project Lead and logic design Paul Keltcher, processor simulation, application development Shankar Venkataraman, OS and application development Krishnan Venkitakrishnan, logic design and bus architecture Joseph Ku, circuit design Manohar Prabhu and Ayodele Embry, interns

DESIGN AIMS 1) novel funding for microprocessor research; 2) introducing multiprocessing to the embedded market; 3) trading design complexity for coarse grain parallelism; 4) a novel 4-stage microprocessor pipeline; 5) using co-resident on-chip DRAM to supply chip multiprocessor memory needs.

4-STAGE PIPELINE F-stage to fetch the instruction from the instruction cache, D-stage to decode the instruction, E-stage to calculate arithmetic results and/or a memory address, M-stage during which the processor can access the data cache; W-stage during which the processor writes results back to its register file.

4-STAGE PIPELINE F-stage to fetch the instruction from the instruction cache, D-stage to decode the instruction, E-stage to calculate arithmetic results and/or a memory address, M-stage during which the processor can access the data cache; W-stage during which the processor writes results back to its register file.

4-STAGE PIPELINE Reasons for eliminating M-stage: small first level caches can be accessed in a single cycle simple base-plus-offset addressing scheme of the MIPS instruction set allowed addresses to be calculated in the second half of the D stage

4-STAGE PIPELINE / LOAD & STORE

4-STAGE PIPELINE / LOAD & STORE The hit or miss signal for a data cache access does not appear until late in the E stage of the pipeline.

4-STAGE PIPELINE / LOAD & STORE LOAD => The data has already been fetched from the cache, but the miss signal can simultaneously halt the pipeline and prevent the incorrect data from being written to the register until the correct data arrives

4-STAGE PIPELINE / LOAD & STORE STORE => The miss signal arrives in time to prevent incorrect data from going into the cache until the tags can be updated, dirty data can be written out to memory if necessary, and the pipeline can be restarted.

4-STAGE PIPELINE / LOAD & STORE STORE THEN LOAD => Single bubble!

4-STAGE PIPELINE / LOAD & STORE ALU THEN LOAD => Single bubble!

4-STAGE PIPELINE / LOAD & STORE LOAD THEN A DEPENDENT INST => NO bubble!

4-STAGE PIPELINE / BRANCH The instruction following a branch is always executed, regardless of the direction eventually taken by a branch. This extra instruction allows the pipeline to calculate the target of the branch, so that it can speculatively fetch the target in the following cycle.

4-STAGE PIPELINE / BRANCH NO penalty for taken branch: 1 cycle penalty for non-taken branch:

4-STAGE PIPELINE / BRANCH

MEMORY In MPOC’s original plan, 1MB to 4MB of DRAM is placed on the same silicon die with the four processors.

MEMORY Ways of managing local memory: Cache organized as set of data lines Unified physical address space that includes both local and remote memory

RESULT Final status of the design:

REFERENCE Stephen Richardson. MPOC: A Chip Multiprocessor for Embedded Systems, Internet Systems and Storage Laboratory, HP Laboratories Palo Alto