SUN ULTRASPARC-III ARCHITECTURE

Slides:



Advertisements
Similar presentations
THE SPARC ARCHITECTURE Presented By M. SHAHADAT HOSSAIN NAIEEM TOURZO KHAN SARDER FERDOUS SADIQUE
Advertisements

Computer Organization and Architecture
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Lecture 12 Reduce Miss Penalty and Hit Time
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Performance of Cache Memory
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Chapter Six 1.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
THE SPARC ARCHITECTURE: THE SUPERSPARC MICROPROCESSOR Presented By OZAN AKTAN
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
Computer Organization and Architecture
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter Six Sun SPARC Architecture. SPARC Processor The name SPARC stands for Scalable Processor Architecture SPARC architecture follows the RISC design.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Processor Architecture
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Sun Microsystems’ UltraSPARC-IIi a Stunt-Free Presentation by Christine Munson Amanda Peters Carl Sadler.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Computer Architecture Chapter (14): Processor Structure and Function
Instruction Level Parallelism
PowerPC 604 Superscalar Microprocessor
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Introduction to Pentium Processor
Pipelining: Advanced ILP
Superscalar Processors & VLIW Processors
Superscalar Pipelines Part 2
Comparison of Two Processors
Alpha Microarchitecture
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
ARM ORGANISATION.
Presentation transcript:

SUN ULTRASPARC-III ARCHITECTURE CMPE 511 PRESENTATION Prepared by:Balkır Kayaaltı

Introduction SPARC stands for a Scalable Processor ARChitecture. It is an open processor architecture.(i.e. Member companies to the SPARC community can freely produce the processor) SUN ULTRA SPARCv9 is a robust RISC architecture with -64 bit integer address and data -Superscalar implementations -Extremely fast trap handling and context switching. The presentation will look in detail to the SUN Microsystem’s Ultra SPARC III v9 architecture.

Major Architectural units The processor’s micro-architecture design has six major functional units that perform relatively independently: Instruction issue unit (IIU) Floating point unit (FPU) Integer execution unit (IEU) Data cache unit (DCU) External memory unit (EMU) System interface unit (SIU) The units communicate requests and results among themselves through well-defined interface protocols, as the next figure

Communication paths between architectural units

Instruction issue unit This unit feeds the execution pipelines with the instructions. It independently predicts the control flow through a program and fetches the predicted path from the memory system. Fetched instructions are staged in a queue before forwarding to the two execution units: ‘integer and floating point’ This unit includes: 32-Kbyte, four-way associative ‘Instruction cache’ ‘The instruction address translation buffer’ A 16 K-entry ‘branch predictor’

Ultra SPARC-III pipeline and physical data Pipeline feature Parameter Instruction issue 4 integer 2 float point 2 graphics Level-one(L1) caches Data: 64-Kbyte, 4-way Instructions: 32-Kbyte, 4-way Prefetch: 2-Kbyte,4-way Write : 2-Kbyte,4-way Level-two(L2) cache Unified (data and instructions) 4- and 8-Mbyte,1-way On-chip tags;off chip data

Pipeline

Pipeline blocks Stage Function A Generate instruction fetch addresses, generate pre-decoded instruction bits on P Fetch first cycle of instructions from cache; access first cycle of branch prediction F Fetch second cycle of instructions from cache; access second cycle of branch prediction; translate virtual-to- physical address B Calculate branch target addresses; decode first cycle of instructions I Decode second cycle of instructions;enqueue instructions into the queue J Steer instructions to execution units R Read integer register file operands; check operand dependencies E Execute integers for arithmetic, logical, and shift instructions; read, and check dependency of, first cycle of data cache access floating-point register file

Pipeline blocks[2] Stage Function C Access second cycle of data cache, and forward load data for word and doubleword loads; execute first cycle of floating-point instructions M Load data alignment for half-word and byte loads; execute second cycle of floating-point instructions W Write speculative integer register file; execute third cycle of floating-point instructions X Extend integer pipeline for precise floating-point traps; execute fourth cycle of floating-point instructions T Report traps D Write architectural register file

Pipeline The instruction issue unit :Stages A-J The execution unit :Stages R-D data cache: E, C, M, and W stages of the pipe in parallel with integer execution unit stages Floating point unit: Side pipeline parallel E through D stages of the integer pipeline

Pipeline

Instruction issue unit cont. To increase the performance high level of instruction parallelism is desired. Ultra SPARC is a static speculation machine. - Dynamic speculation machines require very high fetch bandwidths to fill an instruction window and find instruction-level parallelism. - In a static speculation machine the compiler can make the speculated path sequential, resulting in fewer requirements on the instruction fetch unit.

Instruction issue unit: Stage A: Address lines enter to the instruction cache. All fetch address generation and selection occurs. Stage P,F: Instruction cache access. Branch prediction Instruction address translation access

By the time the instructions are available from the cache in the B stage, we also have the physical address from the translator and a prediction for any branch that was fetched. The processor uses all this information in the B stage to determine whether to follow a sequential or taken-branch path

Branch prediction The processor also determines whether the instruction cache access was a hit or miss. If the processor predicts a taken branch in the B stage, the processor sends back the target address for the branch to the A stage to redirect the fetch stream. Waiting until the B stage to redirect the fetch stream lets us use a large, accurate branch predictor. Branch predictor uses a ‘G-share algorithm’ with 16K 2-bit saturating up/down counters Predictor is pipelined since it is big.

Instruction buffer (queue) There are 2 instruction queue’s designed (instruction queue and miss queue) The 20-entry instruction queue decouples the fetch unit from the execution units, allowing each to proceed at its own rate If a branch is taken at the two cycles that should pass for filling the queue with right instructions , immediately instructions in the miss queue can be used.

Integer execute unit Execution pipelines can support concurrent launch up to six instructions; which can consist of: -two integer operations,A0/A1 pipelines -two FP operations, FP pipelines -one memory operation (load/store), MS pipeline -one special purpose memory operation ( prefetch cache load only) -one control transfer instruction (CTI), BR pipeline However only four Instructions per cycle (IPC) can be executed in a sustain manner.

Working and Architectural Register File (WARF) Physically it is a one block but logically it can be seen as two separate register files. (working register file and architectural) SPARC architectures use register files and windowing techniques. Any time 8 global registers can be reached g0 – g7 Global register g0 is always ‘0’. At any time, an instruction can access the 8 global and a 24-register window into the registers. A register window comprises the 8 ‘in’ and 8 ‘local registers’ of a particular register set, ttogether with the 8 ‘in’ registers of an adjacent register set, which are addressable from the current window as out registers.

Register windows

WARF WRF consist of 32 – 64-bit registers (each of with 3 write,7 read ports and 32*64=2048 minus 64 =1984 bit write port to transport data from Architectural register file ARF has 160 entries (Total 8 register windows) 8x8=64 for local registers in the window 8x8=64 registers for 16 IN/OUT shared registers. 28 register for 4 set of 8 global registers. The WRF manages as single window & updated as results computed

The processor accesses the WRF in the pipeline’s R stage and supplies integer operands to the execution units. Most integer operations complete in one cycle , so result can be written immediately at C stage. If an exceptional event occurs, results written must be undone; so original copies of integer registers are copied using broadside copy of all integer files from appropriate ARF window. The place where to architecture register file is written at the end of the pipeline since all exceptions should be resolved. ARF fills 16 WRF entries after a window change On an exception 31 nonzero registers of WRF should be updated.

On chip memory system Chache diagram used in the architecture

On chip memory system Level-one(L1) caches Data: 64-Kbyte, 4-way Instructions: 32-Kbyte, 4-way Prefetch: 2-Kbyte,4-way Write : 2-Kbyte,4-way Level-two(L2) cache Unified (data and instructions) 4- and 8-Mbyte,1-way On-chip tags; off chip data average latency = L1 hit time + L1 miss rate * L1miss time + L2 miss rate * L2 miss time

Prefetch cache Performance is highly increased by using a ‘Prefetch Cache’ in parallel with the ‘L1 data cache’. By issuing up to eight in-flight prefetches to main memory, the prefetch cache enables program to utilize 100% of the available main memory bandwidth without incurring a slow-down due to the main memory latency.

Prefetch cache The prefetch cache :2-Kbyte SRAM organized as 32 entries of 64 bytes and using four-way associativity with an LRU replacement policy. A multi-port SRAM design let us achieve a very high throughput. Data can be streamed through the prefetch cache in a manner similar to stream buffers. On every cycle, each of two independent read ports supply 8 bytes of data to the pipeline while a third write port fills the cache with 16 bytes.

Prefetch cache Some early processors like Ultra Sparc II uses prefetch instructions. Autonomous stride prefetch engine that tracks the program counters of load instructions and detects when a load instruction is striding through memory . When the prefetch engine detects a striding load, the prefetch engine issues a hardware prefetch independent of any software prefetch. This allows the prefetch cache to be effective even on codes that do not include prefetch instructions.

Write cache Write-caching is an excellent way to reduce the bandwidth due to store traffic. A write cache is used in SPARC-III to reduce the store traffic bandwidth to the off-chip L2 data cache Size is 2Kbyte -4 way associative Advantage of using it is : being the sole source of on-chip dirty data, the write cache easily handles both multiprocessor and on-chip cache consistency. Error recovery also becomes easier with the write cache, since the write cache keeps all other on-chip caches clean and simply invalidates them when an error is detected.

Write chaching A byte validate policy is used on the write cache. Rather than reading the data from the L2 cache for the bytes within the line that are not being overwritten, we just keep an individual valid bit for each byte. Not performing the read-on-allocate saves considerable L2 cache bandwidth by postponing a read-modify-write until the write cache evicts a line. Frequently, by eviction time the entire line has been written so the write cache can eliminate the read. Write cache is included in the L2 data cache and write-cache data can supersede read data from the L2 data cache . We handle this by a byte-merging multiplexer on the incoming L2 cache data bus that can choose either writecache data or L2 cache data for each byte.

Floating point unit This unit contains data paths and control logic to execute floating point and partitioned fixed-point data type instructions. Three data paths concurrently execute floating point or graphics instructions, one each per cycle from the following classes: -Divide/multiply (single or double precision or partitioned) -Add/subtract/compare (single or double precision or partitioned) -An independent division datapath which lets non-pipelined divide proceed concurrently with the full pipelined multiply and adder paths. In order to meet the cycle time of the floating point operations latency cycles must be added. With using advanced circuit techniques for floating point add multiply units a latency cycle will be enough.

External memory interface External memory consist of a large L2 cache built off chip and a main memory built off chip using synchronous DRAM’s. Size of L2 caches: 4 or 8 Mbyte Latency: 12 clock cycles to support 32 byte line to L1 Tags for the L2 is placed on-chip to early detect L2 miss (L2 cache controller accesses on-chip tags parallel with the start of the off-chip SRAM access and provide a way select signal to a late select address pin on the off-chip SRAMs)

L2 caches are Wave-pipelined and operate at 600MHz., Main memory DRAM controller is on chip, reducing memory latency and scales the memory bandwidth with the number of processor. The memory controller supports up to 4 Gbytes of SDRAM memory organized as four independent banks.

Trap stage in the pipeline In this architecture classical stall signal( which freezes the state of the pipeline is eliminated for performance purposes) Instead a trap stage is put at the end of the pipeline to restore a state when an unexpected event occurs. It’s handled like a trap:the instructions that are in the pipeline will be refetched from Stage A.

Conclusion One of the advanced RISC microprocessor is the Sun Microsystems UltraSPARC.It finds many application in desktops, network systems , scientific calculation machines. The internal architecture of the UltraSPARC-III. is represented . Various parts of the processor is examined like: instruction issue, execution, on chip and external memory.

References 1) ‘Ultra Sparc III:Designing Third -Generation 64-Bit performance’ ,IEEE Micro ,June 1999 2)’Design Decisions Influencing Ultra SPARC’s Instruction Fetch Architecture’, 29th annual IEEE/ACM International Symposium on Microarchitecture ,p178-190,1996 Paris 3)Ultra SPARC III v9 Manual,Sun Microsystems.

THANK YOU