CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Intel Xeon Nehalem Architecture Billy Brennan Christopher Ruiz Kay Sackey.
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
CS 61C: Great Ideas in Computer Architecture Case Studies: Server and Cellphone microprocessors Instructors: Krste Asanovic, Randy H. Katz
ECE 109 / CSCI 255 What’s next.
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Intro to CS Chapt 2 Data Manipualtion 1 Data Manipulation How is data manipulated inside a computer? –How is data input? –How is it stored? –How is it.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 CS503: Operating Systems Spring 2014 Dongyan Xu Department of Computer Science Purdue University.
COMP 2003: Assembly Language and Digital Logic Chapter 7: Computer Architecture Notes by Neil Dickson.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Hyper-Threading Technology Architecture and Micro-Architecture.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
4/25/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 22: Putting it All Together Krste Asanovic Electrical Engineering and.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Lecture on Central Process Unit (CPU)
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
Instructor: Chapter 2: The System Unit. Learning Objectives: Recognize how data is processed Understand processors Understand memory types and functions.
Lecture # 10 Processors Microcomputer Processors.
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Instructor: Syed Shuja Hussain Chapter 2: The System Unit.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
1 Lecture 5a: CPU architecture 101 boris.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
ECE232: Hardware Organization and Design
Discovering Computers 2011: Living in a Digital World Chapter 4
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
INTEL HYPER THREADING TECHNOLOGY
QuickPath interconnect GB/s GB/s total To I/O
CS-301 Introduction to Computing Lecture 17
Flow Path Model of Superscalars
Hyperthreading Technology
The Microarchitecture of the Pentium 4 processor
Intel Xeon Nehalem Architecture
Introduction to Computing
Comparison of Two Processors
Lecture: DRAM Main Memory
Ka-Ming Keung Swamy D Ponpandi
Lecture: Cache Innovations, Virtual Memory
Today’s agenda Hardware architecture and runtime system
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Overview Prof. Eric Rotenberg
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley

5/5/ CS152-Spring’09 Intel Nehalem Review entire semester by looking at most recent microprocessor from Intel Nehalem is code name for microarchitecture at heart of Core i7 and Xeon 5500 series server chips First released at end of 2008

5/5/ CS152-Spring’09 Nehalem System Example: Apple Mac Pro Desktop 2009 Two Nehalem Chips (“Sockets”), each containing four processors (“cores”) running at up to 2.93GHz Each chip has three DRAM channels attached, each 8 bytes wide at 1.066Gb/s (3*8.5GB/s). Can have up to two DIMMs on each channel (up to 4GB/DIMM) “QuickPath” point-point system interconnect between CPUs and I/O. Up to 25.6 GB/s per link. PCI Express connections for Graphics cards and other extension boards. Up to 8 GB/s per slot Disk drives attached with 3Gb/s serial ATA link Slower peripherals (Ethernet, USB, Firewire, WiFi, Bluetooth, Audio)

5/5/ CS152-Spring’09 Building Blocks to support “Family” of processors

5/5/ CS152-Spring’09 Nehalem Die Photo

5/5/ CS152-Spring’09 Nehalem Memory Hierarchy Overview CPU Core 32KB L1 D$ 32KB L1 I$ 256KB L2$ 8MB Shared L3$ CPU Core 32KB L1 D$ 32KB L1 I$ 256KB L2$ 4-8 Cores DDR3 DRAM Memory Controllers QuickPath System Interconnect Each direction is Each DRAM Channel is 64/72b wide at up to 1.33Gb/s Private L1/L2 per core L3 fully inclusive of higher levels (but L2 not inclusive of L1) Other sockets’ caches kept coherent using QuickPath messages Local memory access latency ~60ns

5/5/ CS152-Spring’09 All Sockets can Access all Data ~60ns ~100ns

5/5/ CS152-Spring’09 In-Order Fetch In-Order Decode and Register Renaming Out-of-Order Execution In-Order Commit Out-of-Order Completion 2 SMT Threads per Core

5/5/ CS152-Spring’09 Front-End Instruction Fetch & Decode µOP is Intel name for internal RISC-like instruction, into which x86 instructions are translated x86 instruction bits internal µOP bits Loop Stream Detector (can run short loops out of the buffer)

5/5/ CS152-Spring’09 Branch Prediction Part of instruction fetch unit Several different types of branch predictor –Details not public Two-level BTB Loop count predictor –How many backwards taken branches before loop exit –(Also predictor for length of microcode loops, e.g., string move) Return Stack Buffer –Holds subroutine targets –Renames the stack buffer so that it is repaired after mispredicted returns –Separate return stack buffer for each SMT thread

5/5/ CS152-Spring’09 x86 Decoding Translate up to 4 x86 instructions into uOPS each cycle Only first x86 instruction in group can be complex (maps to 1-4 uOPS), rest must be simple (map to one uOP) Even more complex instructions, jump into microcode engine which spits out stream of uOPS

5/5/ CS152-Spring’09 Split x86 in small uOPs, then fuse back into bigger units

5/5/ CS152-Spring’09 Loop Stream Detectors save Power

5/5/ CS152-Spring’09 Out-of-Order Execution Engine Renaming happens at uOP level (not original macro-x86 instructions)

5/5/ CS152-Spring’09 SMT effects in OoO Execution Core Reorder buffer (remembers program order and exception status for in-order commit) has 128 entries divided statically and equally between both SMT threads Reservation stations (instructions waiting for operands for execution) have 36 entries competitively shared by threads

5/5/ CS152-Spring’09 Nehalem Virtual Memory Details Implements 48-bit virtual address space, 40-bit physical address space Two-level TLB I-TLB (L1) has shared 128 entries 4-way associative for 4KB pages, plus 7 dedicated fully-associative entries per SMT thread for large page (2/4MB) entries D-TLB (L1) has 64 entries for 4KB pages and 32 entries for 2/4MB pages, both 4-way associative, dynamically shared between SMT threads Unified L2 TLB has 512 entries for 4KB pages only, also 4-way associative Additional support for system-level virtual machines

5/5/ CS152-Spring’09 Core’s Private Memory System Load queue 48 entries Store queue 32 entries Divided statically between SMT threads Up to 16 outstanding misses in flight per core

5/5/ CS152-Spring’09

5/5/ CS152-Spring’09 Core Area Breakdown

5/5/ CS152-Spring’09

5/5/ CS152-Spring’09 Quiz Results

5/5/ CS152-Spring’09 Related Courses CS61C CS 152 CS 258 CS 150 Basic computer organization, first look at pipelines + caches Computer Architecture, First look at parallel architectures Parallel Architectures, Languages, Systems Digital Logic Design Strong Prerequisite CS 250 Complex Digital Design (chip design) CS 252 Graduate Computer Architecture, Advanced Topics

5/5/ CS152-Spring’09 Advice: Get involved in research E.g., RAD Lab - data center Par Lab - parallel clients Undergrad research experience is the most important part of application to top grad schools.

5/5/ CS152-Spring’09 End of CS152 Final Quiz 6 on Thursday (lectures 19, 20, 21) HKN survey to follow. Thanks for all your feedback - we’ll keep trying to make CS152 better.