Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Computer performance.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
Alpha 21364: A Scalable Single-chip SMP
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Lecture 19 Today’s topics Types of memory Memory hierarchy.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
1 Class Presentation For Advanced VLSI Course Professor: Dr. S. M. Fakhraie Presented by: Sayyed Hassan Sohofi Major Reference: A 0.13µm Triple-Vt 9MB.
Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the Power6  Clock Rate: 2.4 GHz GHz  Feature size: 45.
The Alpha – Data Stream Matt Ziegler.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
© 2004 IBM Corporation Power Everywhere POWER5 Processor Update Mark Papermaster VP, Technology Development IBM Systems and Technology Group.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
William Stallings Computer Organization and Architecture 6th Edition
Architecture & Organization 1
CS-301 Introduction to Computing Lecture 17
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hyperthreading Technology
Technology and Historical Perspective: A peek of the microprocessor Evolution 11/14/2018 cpeg323\Topic1a.ppt.
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
BIC 10503: COMPUTER ARCHITECTURE
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Computer Evolution and Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hardware Overview System P & Power5.
Presentation transcript:

Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation of the POWER5 TM Microprocessor J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J. Dawson2, P. Muench2, L. Powell1, M. Floyd1, B. Sinharoy2, M. Lee1, M. Goulet1, J. Wagoner1, N. Schwarz1, S. Runyon1, G. Gorman1, P. Restle3, Kalla1, J. McGill1, S. Dodson1 1IBM System Group, Austin, TX 2IBM System Group, Poughkeepsie, NY 3IBM Research, Yorktown Heights, NY IEEE International Solid-State Circuits Conference 2004 IEEE International Solid-State Circuits Conference 2004

Outline Motivation Background Threading Fundamentals Enhancement SMT Implementation in POWER5 Memory Subsystem Enhancements Power Efficiency Additional SMT Considerations Summary

Microprocessor Design Optimization Focus Areas Memory latency Increased processor speeds make memory appear further away Longer stalls possible Branch processing Mispredict more costly as pipeline depth increases resulting in stalls and wasted power Predication drives increased power and larger chip area Execution Unit Utilization Currently 20-25% execution unit utilization common Simultaneous multi-threading (SMT) and POWER architecture address these areas Motivation …

POWER4 --- Shipped in Systems December 2001 Technology: 180nm lithography, Cu, SOI POWER4+ shipping in 130nm today 267mm 2 185M transistors Dual processor core 8-way superscalar Out of Order execution Load / Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition Register Branch Execution unit > 200 instructions in flight Hardware instruction and data prefetch Background …

POWER5 --- The Next Step Technology: 130nm lithography, Cu, SOI 389mm 2 276M Transistors Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor Natural extension to POWER4 design Background …

System-level view of POWER5 Background …

Multi-threading Evolution Threading …

Changes Going From ST to SMT Core SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most address/tag buses Enhancement …

POWER5 Resources Size Enhancements Enhanced caches and translation resources I-cache: 64 KB, 2-way set associative, LRU D-cache: 32 KB, 4-way set associative, LRU First level Data Translation: 128 entries, fully associative, LRU L2 Cache: 1.92 MB, 10-way set associative, LRU Larger resource pools Rename registers: GPRs, FPRs increased to 120 each L2 cache coherency engines: increased by 100% Enhanced data stream prefetching Memory controller moved on chip Enhancement …

Thread Priority Instances when unbalanced execution desirable No work for opposite thread Thread waiting on lock Software determined non uniform balance Power management … Solution: Control instruction decode rate Software/hardware controls 8 priority levels for each thread Enhancement …

Modifications to POWER4 System Structure Memory …

Power Efficient Design Implementation DC power mitigation Leverage triple Vt technology Decrease low Vt usage by 90% Increase high Vt usage by 30% Leverage triple Tox technology Thick Tox usage for decoupling capacitors AC power mitigation Minimal usage of dynamic circuits Reduce loading on clock mesh Incorporation of dynamic clock gating Power …

Thermal control logic and sample thermal response. Power …

16-way Building Block Additional …

POWER5 Multi-Chip Module 95mm % 95mm Four POWER5 chips Four cache chips 4,491 signal I/Os 89 layers of metal Additional …

64-way SMP Interconnection Additional … Interconnection exploits enhanced distributed switch All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchy Additional … POWER4POWER5 L2 Cache Capacity, Line Size, Associativity, Replacement 1.44 MB, 128 B Line 8-way, LRU 1.92 MB, 128 B Line 10 – way, LRU Off-Chip L3 Cache Capacity, Line Size, Associativity,Replacement 32 MB, 512 B Line 8-way, LRU 36 MB,256 B Line 12 – way, LRU Chip interconnect Type Intra-MCM data buses Inter-MCM data buses Distributed Switch ½ processor speed Enhanced Distributed Switch Processor speed Memory 512 GB maximum 1024 GB (1 TB) Maximum

POWER Server Roadmap Additional …

Summary First dual core SMT microprocessor Extended SMP to 64-way Operating in laboratory Power dynamically managed with no performance penalty Implementation permits future technology scalability from circuit and power perspective Innovative approach leveraging technology with system focus for high performance in a power efficient design Summary …

Other References [1] R. Kalla, B. Sinharoy, J. Tendler, “IBM POWER5 CHIP : A DUAL-CORE MULTITHREADED PROCESSOR”, IEE Computer Society, MARCH-APRIL 2004 [2] R. Kalla, IBM System Group, “IBM’s POWER5 Design and Methodology”, IBM Corporation 2003