The Alpha 21264 Thomas Daniels Other Dude Matt Ziegler.

Slides:



Advertisements
Similar presentations
PIPELINE AND VECTOR PROCESSING
Advertisements

Computer Organization and Architecture
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
1 ECE462/562 ISA and Datapath Review Ali Akoglu. 2 Instruction Set Architecture A very important abstraction –interface between hardware and low-level.
Lecture 12 Reduce Miss Penalty and Hit Time
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Some Other Instruction Set Architectures. Overview Alpha SPARC i386.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
Processor Technology and Architecture
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Chapter 5 Basic Processing Unit
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Principles of Linear Pipelining
Lecture 11: 10/1/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
ULTRASPARC 2005 INTRODUCTION AND ISA BY JAMES MURITHI.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Pipelining and Parallelism Mark Staveley
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Computer Organization Exam Review CS345 David Monismith.
ALPHA Introduction I- Stream
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
ELEN 468 Advanced Logic Design
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Vector Processing => Multimedia
MMX Multi Media eXtensions
Superscalar Pipelines Part 2
CS170 Computer Organization and Architecture I
Computer Organization “Central” Processing Unit (CPU)
Comparison of Two Processors
Alpha Microarchitecture
* From AMD 1996 Publication #18522 Revision E
COMS 361 Computer Organization
Chapter 4 The Von Neumann Model
What is Computer Architecture?
Computer Architecture
Course Outline for Computer Architecture
Chapter 4 The Von Neumann Model
Presentation transcript:

The Alpha Thomas Daniels Other Dude Matt Ziegler

Outline Overview Instruction Set Architecture Instruction Stream Data Stream

System Overview RISC instruction Set RISC instruction Set 64-bit processor 64-bit processor 15 million transistors 15 million transistors 1 st Alpha w/ out-of-order execution 1 st Alpha w/ out-of-order execution Speculative execution Speculative execution

Specs 6/7-stage pipeline 6/7-stage pipeline 4-way integer issue 4-way integer issue 2-way floating point issue 2-way floating point issue Peak of 6 instructions per cycle Peak of 6 instructions per cycle Sustainable of 4 instructions per cycle Sustainable of 4 instructions per cycle Can have up to 80 instructions in process at one time Can have up to 80 instructions in process at one time 4-way out-of-order issue 4-way out-of-order issue

Specs 4 integer execution units 4 integer execution units 2 general purpose units 2 general purpose units 2 address ALUs 2 address ALUs 32 int & 31 fp registers (64-bits wide) 32 int & 31 fp registers (64-bits wide) 48 int & 40 fp reorder registers 48 int & 40 fp reorder registers 80 additional int registers (copies other 80) 80 additional int registers (copies other 80)

RISC vs. CISC P3 and Athlon are CISC processors P3 and Athlon are CISC processors is a RISC processor is a RISC processor Today’s difference is that RISC ISA generally do not use operands from memory. Today’s difference is that RISC ISA generally do not use operands from memory.

Differences from Out of order issue Out of order issue Smaller pipeline Smaller pipeline Increased memory bandwidth Increased memory bandwidth Memory references can be accessed in parallel to caches Memory references can be accessed in parallel to caches One pipeline for both floating point and integer operations One pipeline for both floating point and integer operations 4x the bandwidth 4x the bandwidth

Previous instruction types Branch Branch Floating point Floating point Memory Memory Memory/Function code Memory/Function code Memory/branch Memory/branch Operate Operate PALcode PALcode

New instructions Floating Point Floating Point Cache prefetching Cache prefetching MVI (motion video instructions) MVI (motion video instructions)

Floating point instructions To Better calculate square root To Better calculate square root Single precision (SQRTS) Single precision (SQRTS) Double precision (SQRTT) Double precision (SQRTT) Move data between floating point and integer register files Move data between floating point and integer register files

Prefetch instructions Allows the compiler to exploit the higher bandwidth Allows the compiler to exploit the higher bandwidth Five instructions introduced Five instructions introduced

Prefetch instructions Cache Prefetch and Management Instructions Normal Prefetch The fetches the (64-byte) block into the (level one data and level two) cache. Prefetch with Modify Intent The same as the normal prefetch except that the block is loaded into the cache in dirty state so that subsequent stores can immediately update the block. Prefetch and Evict Next The same as the normal prefetch except that the block will be evicted from the (level one) data cache as soon as there is another block loaded at the same cache index. Write Hint 64The obtains write access to the 64-byte block without reading the old contents of the block. The application typically intends to over-write the entire contents of the block. EvictThe cache block is evicted from the caches.

MVI Or Motion Video Instructions Or Motion Video Instructions Set of Alpha processor instructions that are categorized as SIMD Set of Alpha processor instructions that are categorized as SIMD Intended to implement high quality software video encoding (MPEG-1, MPEG-2, H.261 and H.263 Intended to implement high quality software video encoding (MPEG-1, MPEG-2, H.261 and H.263

MVI (unpack) UNPKBW UNPKBW Unpack bytes to words Unpack bytes to words UNPKBL UNPKBL Unpack bytes to long words Unpack bytes to long words

MVI (pack) PACKWB PACKWB Truncates the four component words of the input register and writes them to the low four bytes of the output register. Truncates the four component words of the input register and writes them to the low four bytes of the output register. PACKLB PACKLB Truncates the 2 component long words of the input register to byte values and writes them to the low two bytes of the output register. Truncates the 2 component long words of the input register to byte values and writes them to the low two bytes of the output register.

MVI (Byte & Word Min. and Max.) MINUB8 MINUB8 MINUW4 MINUW4 MINSB8 MINSB8 MINSW4 MINSW4 MAXUB8 MAXUB8 MAXUW4 MAXUW4 MAXSB8 MAXSB8 MAXSW4 MAXSW4 Take the form MINxxx Ra, Rb, Rc

MVI (PERR) Replaced nine instructions for motion estimation. Replaced nine instructions for motion estimation. It takes the 8 bytes packed into 2 quadword registers and computes the absolute difference between them, then adds the eight intermediate results and right aligns the result in the destination register. It takes the 8 bytes packed into 2 quadword registers and computes the absolute difference between them, then adds the eight intermediate results and right aligns the result in the destination register. RESULT: Motion estimation calculations on 8 pixels in a single clock tick. RESULT: Motion estimation calculations on 8 pixels in a single clock tick.