Computer architectures M

Computer architectures M
Latest processors Computer architectures M

Introduction – new requirments
No more clock frequency increase (power consumption – temperature – heat dissipation) Power consumption and configurations totally scalable QPI Greater size caches (integrated L1, L2 and L3) First example (2009) Nehalem: mono/duo/quad/8-core processor, multithreaded 2 (2/4/8/16 virtual processors), 64 bit parallelism, superscalar 4 (4 CISC instructions to the decoders in parallel => 4+3=7 u-ops per clock), 4 u-ops for RAT per clock, 6 u-ops to the EUs per clock), physical address 40 bit => 1 TeraByte (one million Mbytes), tecnology 45 nm. From 700M to 2G transistors on chip. Sandy bridge tecnology 32nm -> 2,3G transistors

but with Hyperthreading
Nehalem Processor Memory controller M i s c I O IO I level Cache Core Core Q u e u e Core Core Q P I 1 Q P I 0 II level Cache Cache L3 shared cache QPI CORE architecture but with Hyperthreading

3 Integrated dynamic Memory Controllers DDR3
Cores, Caches and Links Riconfigurable architecture Notebook: 1 or 2 cores systems which must have reduced costs, low power consumption and low execution latency time for single tasks Desktop: similar characteristics but less importance for the power consumption. High bandwith for graphical applications Server: high cores number, very high bandwith and low latency for many different tasks. RAS – Reliability Avalability Serviceability of tantamount importance 3 Integrated dynamic Memory Controllers DDR3 DRAM QPI Core Uncore C O R E IMC Power & Clock … L3 Cache C O R E C O R E

Nehalem characteristics
QPI bus. Each processor (core) has an instruction cache 32KB, 4 ways associative, a data cache 32 KB, 8 ways associative, and an unified (data and instructions) II level cache 256 KB, 8 ways associative. The II level caches is not inclusive Each quad-core socket (node) relies on a maximum of three DDR3 which operate up to 32GB/s peak bandwidth. Each channel operates in independent mode and the controller handles the requests OOO so as to minimize the total latency Each core can handle up to 10 data cache misses and up to 16 transactions concurrently (i.e. instructions and data retrieval from a higher level cache). In comparison Core 2 could handle 8 data cache misses and 14 transactions. Third level cache (inclusive cache). There is a central queue which allows the interconnection and arbitration between the cores and the “uncore” region (common to all cores) that is L3, the memory controller and the QPI interface From the performance point of view an inclusive L3 is the ideal configuration since it allows to to handle efficiently the coherence problems of the chip (see later) and avoids data replications. Since it is inclusive any data present in any core is present in L3 too (although possibly in a non coherent state) The caches sizes change according to the model

Increased power Core L3 and beyond 7 u-ops 4 -uops Reservation Station
Instruction Fetch ITLB 32kB Instruction Cache Instruction Queue Two unified levels Decoder (1+3) 7 u-ops 2nd Level TLB 256kB 2nd Level Cache Rename/Allocate L3 and beyond Retirement Unit (ReOrder Buffer) 4 -uops Reservation Station 6 u-ops Execution Units DTLB 32kB Data Cache

Macrofusion All Core macrofusion cases plus…
CMP+Jcc macrofusion added for these branch conditions too JL/JNGE JGE/JNL JLE/JNG JG/JNLE

Core loop detector Exploits the hardware loop detection. The loop detector analyzes the branches and determin whether it is a loop (fixed direction jump with same in the reverse direction) . Avoids the repetitive fetch and branch prediction … But requires the decoding each cycle Decode Branch Prediction Fetch Loop Stream Detector 18 CISC instructions

Nehalem loop detector Nehalem
Similar concepts but Higher instructions number considered Decode Loop Stream Detector 28 Micro-Ops Similar to the trace cache Branch Prediction Fetch After the Loop Stream Detector, the last step is to use a separated stack which removes all u-ops regarding the stack. The u-ops which speculatively point to the stack are handled by a separate adder which writes on a “delta” register (a register apart - RSB) which is periodically synchronized with the architectural register which contains the non speculated stack pointer. All u-ops which manipulate the stack pointer do not enter the execution units,

Two levels branch predictor
Double level in order to include an ever increasing number of predictions. Mechanism similar to that of the caches The first level BTB has 256/512 entries (according to the model): if the data is not found there a second level (activated upon access) is interrogated with 2-8K entries. This reduces the power consumption since the second level BTB very seldom is activated Increased RSB slots number (higher speculative execution)

More powerful memory subsystem
Fast access to non aligned data. Greater freedom for the compiler Hierarchical TLB # of Entries 1st Level Instruction TLBs 4k Pages 128 slots – 4 ways 1st Level Data TLBs 64 slots – 4 ways 2nd Level Unified TLB 512 slots – 4 ways

Nehalem internal architecture
Nehalem two TLB levels are dynamically allocated between the two threads. A very big difference with Core is the caches coverage degree. Core had a 6 MB L2 cache and the TLB had 136 entries (4 ways ). Considering 4k pages the coverage was 136x4x4KB= KB, a third of L3 (8MB). Nehalem has a 576 entries DTLB (512 second level + 64 first level) which means 4x576=2304 which amounts to 2304x4KB=9216 KB memory which covers the entire L3 (8MB). (The meaning is that the address translation has a great possibility of targeting data in L3)

(Smart) Caches … L3 Cache Core Three levels hierarchical
32kB L1 Data Cache 32kB L1 Inst. Cache Three levels hierarchical First level: 32KB Instrucions (4 ways) and 32 KB Data (8 ways) Second level: unified 256KB (8 ways – 10 cycles for access) Third level shared among the various cores. The size depends on the number of cores For a quad-core 8 MB 16 ways. Latency cycles Designed for future expansion 256kB L2 Cache Inclusive: all addresses in L1 and L2 are present in L3 (possibly with different states) Core Core Core Each L3 line has n (4 in case of quad core) “core valid” bits which indicate which (if any) cores have a copy of the line. (If a datum in L1 or L2 certainly is in L3 too but not viceversa). L1 Caches L1 Caches L1 Caches … It has a private power plane and operated at a private frequency (not the same of the cores) and higher voltage. This because the power must be spared and big size caches have very often errors if their voltage is too low. L2 Cache L2 Cache L2 Cache L3 Cache

Inclusive cache vs. non exclusive
L3 Cache L3 Cache Core Core 1 Core 2 Core 3 Core Core 1 Core 2 Core 3 An example: the data requested by Core 0 is not present in its L1 and L2 and therefore is requested to the common L3

Inclusive cache vs. exclusive
L3 Cache L3 Cache MISS ! MISS ! Core Core 1 Core 2 Core 3 Core Core 1 Core 2 Core 3 The requested datum cannot be retrieved from L3 too

Inclusive cache vs. exclusive (miss)
L3 Cache L3 Cache MISS! MISS! Core Core 1 Core 2 Core 3 Core Core 1 Core 2 Core 3 A request is sent to the other cores The datum is not on the chip

Inclusive cache vs. exclusive (hit)
L3 Cache L3 Cache HIT ! HIT ! Core Core 1 Core 2 Core 3 Core Core 1 Core 2 Core 3 No further requests to other cores: if in L3 cannot be in other cores (exclusive) The datum could be in other cores too but….

L3 cache has a directory (one bit per core) which indicates if and in which cores the datum is present. A snoop is necessary only if one bit only is set (possible datum modified), If two or more bits are set the line in L3 is «clean» and can be forwarded from L3 to the requesting core. (The philosophy is directory based coherence) L3 Cache HIT! Core Core 1 Core 2 Core 3

L3 Cache L3 Cache MISS! HIT! 1 Core Core 1 Core 2 Core 3 Core Core 1 Core 2 Core 3 Here only the core with datum (core 2) must be tested All cores must be tested

Unified Reservation Station
Execution unit Unified Reservation Station Schedules operations to Execution units Single Scheduler for all Execution Units Can be used by all integer, all FP, etc. Execute 6 u-ops/cycle 3 Memory u-ops 1 Load 1 Store Address 1 Store Data 3 “Computational” u-ops Unified Reservation Station Port 5 Integer ALU & Shift Branch FP Shuffle SSE Integer ALU Integer Shuffles 6 Ports Port 2 Load Port 0 Port 1 Port 3 Port 4 Integer ALU & Shift Integer ALU & LEA Store Address Store Data FP Multiply FP Add Divide Complex Integer SSE Integer ALU Integer Shuffles SSE Integer Multiply 6 ports as in Core

Execution unit Loop Stream Detector
Each fetch retrieves 16 bytes (128 bit) from the cache which are inserted in a predecode buffer where 6 instructions at a time are sent to a 18 instructions queue. 4 CISC instructions at a time are sent to the 4 decoders (if possible) and the decoded u-ops are sent to a 28 slots Loop Stream Detector A new technology is implemented (Unaligned Cache Accesses) which grants the same execution speed to aligned and non-aligned instructions (i.e. crossing a cache line). Before, non aligned instructions were a big execution penalty which prevented very often the use of particular instructions: from Nehalem on this is not any more the case

non-memory instructions
Nehalem internal architecture EU for non-memory instructions The RAT can rename up to 4 u-ops per cycle (number of physical registers different according to the implementation) The renamed instructions in ROB and when the operands are ready inserted in the Unified Reservation Station The ROB and the RS are shared between the two threads (multithread!). The ROB is statically subdivided: identical speculative execution “depth” between the two threads. The RS stage is competitively” shared according to the situation. A thread could be waiting for a memory data and therefore using little or no entried of the RS: it would be senseless to block entries of a blocked thread

EU for memory instructions
Nehalem internal architecture EU for memory instructions Up to 48 Load and 32 Store in the MOB From the Load and Store buffers the u-ops access hierarchically the caches. As in PIV caches and TLB are dynamically subdivided among the two threads. Each core accepts “outstanding misses” (up to 16) for the best use of the increased memory bandwith

L3 shared among the chip cores.
Nehalem full picture L3 shared among the chip cores. Other improvements: SSE 4.2. instructions: string manipulations (very important for XML handling) Instructions for CRC manipulation (important for the transmissions)

Execution parallelism improvement
1 Increased ROB size (33% slots) Improvement of the relates structures Structure Intel® Core™ microarchitecture (Merom) Intel® Core™ microarchitecture (Nehalem) Comment Reservation Stations 32 36 Dispatches operations to execution units Load Buffers 48 Tracks all load operations allocated Store Buffers 20 Tracks all store operations allocated

Nehalem vs Core Core internal architecture modified for QPI
The execution engine is the same. Some blocks added for Integer and FP optimisation.In practice the increased number of RS allows a full use of the EUs which in Core were sometimes starved. For the same reason the multithread was reintroduced (and therefore the greater ROB – 196 slots vs 96 of Core - and the larger number of RS - 36 vs the 32 of Core), Load buffers are now 48 (32 in Core ) and store buffers 32 (20 in Core). The buffers are partitioned between the threads (fairness). It must be noted that the static sharing between threads provides the threads with a reduced number of ROB slots (64 – 128/2) instead of 96) but INTEL states that in case of a single thread all resources are given to it…

Power consumption control
PLL= Phase Lock Loop From a base frequency (quartzed) all requested frequencies are generated Vcc BCLK Core PLL Vcc Power Control Unit Current, temperature and power controlled in real time Flexible: sophisticated hardware algorithm for power consumption optimization Freq . PLL Sensors Core Vcc PCU Freq . PLL Sensors Core Vcc Freq . PLL Sensors Core Uncore , Vcc LLC Freq . PLL Sensors Power supply gate

Core power consumption
Total power consumption Clock distribution Blue: high frequency design requires an efficient global clock distribution Loss currents Green: high frequency systems are affected by unavoidable losses Local clocks and logic Red: clock and gates

Power minimisation Idle Power (W) Exit Latency (ms)
Idle CPU states are called Cn C0 Cn C1 Exit Latency (ms) Idle Power (W) Higher is n lower is the power consumption BUT longer the time for exiting the «idle» state The OS informs the CPU when no processes must be executed – privileged instruction MWAIT (Cn)

C states (before Nehalem)
Total power consumption Clock distribution C0 state : active CPU Loss currents C1 and C2 states : pipeline clock and other clocks majority blocked Local clocks and logic C3 states: all clocks blocked C4,C5 and C6 states: operating voltage progressive reduction Cores had only one power plane and all devices had to be idle before reducing the voltage C higher values were the most expensive for the exiting time (voltage increase, state restore, pipeline restart etc.)

C states behaviour(Core)
Task completed. No task ready. Instruction MWAIT(C6) . Core 0 Core 1 Core Power Time

Core1 execution stopped, its state saved and its clocks stopped . Core 0 keeps executing. Core 0 Core 1 Core Power Time

Core Power Task completed. No task ready. Instruction MWAIT(C6) Time

Core Power Only now it is possible Reduced voltage and power Time

Increased voltage for both processors
C states behaviour(Core) Core 1 interrupt, voltage increased Core 1 clocks reactivated, state restored, the instruction following MWAIT(C6) is executed. C0 keeps idle. Increased voltage for both processors Core 0 Core 1 Core Power Time

Core Power Core 0 interrupt. Core 0 state C0 and the instruction following MWAIT(C6) executed. Core 1 keeps executing

C6 Nehalem Cores 0, 1, 2, and 3 active Separeted core power supply!!!!
Time Cores 0, 1, 2, and 3 active Separeted core power supply!!!!

C6 Nehalem Core 2 task completed. No task ready. MWAIT(C6). Core Power
Time Core 2 task completed. No task ready. MWAIT(C6).

C6 Nehalem Core 0 Core 1 Core 2 Core 3 Core Power Time Core 2 stopped, its clocks stopped. Cores 0, 1, and 3 keep executing

C6 Nehalem Core 0 Core 1 Core 2 Core 3 Core Power Time Core 2 power gate interdicted: voltage is 0 and state C6. Cores 0, 1, and 3 keep executing

C6 Nehalem Core 0 Core 1 Core 2 Core 3 Core Power Time Core 0 task completed. No task ready. MWAIT(C6).Core 0 C6. Cores 1 and 3 keep executing

C6 Nehalem Core 0 Core 1 Core 2 Core 3 Core Power Time Core 2 Interrupt - Core 2 C0 execution from MWAIT(C6). Cores 1 and 3 keep executing

C6 Nehalem Core 0 Core 1 Core 2 Core 3 Core Power Time Core 0 interrupt. Power gate saturated, clocks reactivated, state restored execution from MWAIT(C6). Core 1,2, and 3 keep executing

Core power consumption
C6 Nehalem Core power consumption Losses Clock distribution Clock and logic Cores (x N) All Cores state C6: Core power to ~0 Uncore clock block I/O low power Uncore clock distribution blocked Entire package package C6 Uncore losses Uncore clock distribution I/O Uncore logic

Further power reduction
Memory Memory clocks are blocked between the requests when the usage is low Package memory refresh occurs in C3 (clock block) and C6 (power down) states too Links Low power when the processor increases its Cx The Power Control Unit monitors the interrupts frequency and changes the C states accordingly C states linked to the operating system depend on the processor utilization In presence of some low workloads the use rate can be low but the latency can be of tantamount importance (i.e. real time systems) The CPU can implement complex behaviour optimisations algorithms The system changes the operating clock frequency according to the requirements in order to minimize the power consumption (processor P states) The Power Control Unit modifies the operating voltage for each clock frequency, operating condition and silicon characteristics When a Core enters low power C states its operating voltage is reduced while that of the other Cores is unmodified

Power reduction in inactive cores
Turbo pre-Nehalem (Core) Clock Stopped Power reduction in inactive cores No Turbo Core 0 Core 1 Core 0 Core 1 Workload Lightly Threaded Frequency (F) Frequency (F)

Power reduction in inactive cores
Turbo pre-Nehalem (Core) Clock Stopped Power reduction in inactive cores Turbo Mode In response to workload adds additional performance bins within headroom No Turbo Core 0 Core 1 Core 0 Workload Lightly Threaded Frequency (F) Frequency (F)

Zero power for inactive cores
Turbo Nehalem It uses the available clock frequency to maximize the performance both for multi- and single-thread Power Gating Zero power for inactive cores No Turbo Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Workload Lightly Threaded or < TDP Frequency (F) Frequency (F) TDP: Thermal Design Power. An indication of the heat (energy) produced by a processor, which is also the max. power that the cooling system must dissipate. Measured in Watts

Zero power for inactive cores
Turbo Nehalem Turbo Mode In response to workload adds additional performance bins (frequency increase) within headroom Power Gating Zero power for inactive cores No Turbo Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Workload Lightly Threaded or < TDP Frequency (F) Frequency (F)

Turbo Nehalem No Turbo Core 0 Core 0
Power Gating Zero power for inactive cores Turbo Mode In response to workload adds additional performance bins within headroom No Turbo Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Workload Lightly Threaded or < TDP Frequency (F) Frequency (F)

Turbo Nehalem No Turbo Workload Lightly Threaded or < TDP
Active cores running workloads < Thermal Design Power No Turbo Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Workload Lightly Threaded or < TDP Core 1 Core 3 Core 0 Frequency (F) Core 2 Frequency (F)

Turbo Nehalem No Turbo Core 0 Workload Lightly Threaded or < TDP
Active cores running workloads < TDP Turbo Mode In response to workload adds additional performance bins within headroom No Turbo Core 0 Core 1 Core 2 Core 3 Workload Lightly Threaded or < TDP Core 0 Frequency (F) Core 1 Core 2 Core 3 Frequency (F) TDP = Thermal Design Power

Turbo Nehalem No Turbo Workload Lightly Threaded or < TDP
Power Gating Zero power for inactive cores Turbo Mode In response to workload adds additional performance bins within headroom No Turbo Core 0 Core 1 Core 2 Core 3 Workload Lightly Threaded or < TDP Core 2 Core 0 Frequency (F) Frequency (F) Core 1 Core 3

Turbo enabling Turbo Mode is transparent
Frequency transitions are handled in hw The operating system asks for P-state changes (frequency and voltage) in a transparent way activating the Turbo mode only when needed for better performance The Power Control Unit keeps the silicon within the required boundaries

Westmere Westmere is the name of the 32 nm Nehalem and is the basis of Core i3, Core i5, and multiple cores (i7plus). Characteristics 2-12 native cores (multithreaded => up to 24 processors) 12 MB L3 cache Some versions have an integrated graphic controller A new instructions set (I7) for the Advanced Encryption Standard (AES) and a new instruction PCLMLQDQ which executes multiplications without carry as required by the cryptography (i.e. disks encryption) 1GB page support

Roadmap

Hexa-Core Gulftown – 12 Threads

EVGA W555 - Dual processor Westmere – (2x6x2) = 24 threads !!
Video controller

NON standardE-ATX size 1 or 2 processors and overclockin
Intel 5520 chipset and two nForce 200controllers under the power dissipator 8 SATA ports 2x6 GB/sec and 6x3 GB/sec Processors 6x2 DIMM DDR3 7 PCI NON standardE-ATX size 1 or 2 processors and overclockin

Cooling towers

Roadmap 22 nm technology – 3D trigate transistors – Pipeline 14 stages – Dual channel DDR KB L1 (32K instr + 32K data) and 256 KB L2 per core, Reduced cache latency – Can use DDR4 – Three possible GPU: the most powerful (GT3) has 20 Eus.. Integrated voltage regulator (from the motherboard to the chip). Better power consumption. Up to 100W TDP. 10% improved performance 5 CISC decoded instructions macrofused produce 4 u-ops per clock – Up to 8 u-ops dispatched per clock

Haswell front end

Haswell execution unit
8 ports !! Increasing the OoO window allows the execution units to extract more parallelism and thus improve single threaded performance Prioritized extracting instruction level parallelism

Haswell execution unit

Haswell

Computer architectures M

Similar presentations

Presentation on theme: "Computer architectures M"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer architectures M

Similar presentations

Presentation on theme: "Computer architectures M"— Presentation transcript:

Similar presentations

About project

Feedback