11/12/2013 1 Out-of-the-Box Computing Patents pending IEEE-SVC 2013/11/12 Drinking from the Firehose Cool and cold transfer prediction in the Mill™ CPU.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
CSCI 4717/5717 Computer Architecture
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Advanced Computer Architecture Limits to ILP Lecture 3.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
CMPE 421 Parallel Computer Architecture
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
CS161 – Design and Architecture of Computer
Computer Architecture Chapter (14): Processor Structure and Function
CS161 – Design and Architecture of Computer
William Stallings Computer Organization and Architecture 8th Edition
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Drinking from the Firehose
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Morgan Kaufmann Publishers The Processor
CMSC 611: Advanced Computer Architecture
CSCI1600: Embedded and Real Time Software
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Yingmin Li Ting Yan Qi Zhao
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Wackiness Algorithm A: Algorithm B:
Chapter 11 Processor Structure and function
CSCI1600: Embedded and Real Time Software
Presentation transcript:

11/12/ Out-of-the-Box Computing Patents pending IEEE-SVC 2013/11/12 Drinking from the Firehose Cool and cold transfer prediction in the Mill™ CPU Architecture

11/12/ Out-of-the-Box Computing Patents pending addsx(b2, b5) The Mill Architecture Transfer prediction - without delay New with the Mill: Run-ahead prediction Prediction before code is loaded Explicit prefetch prediction No wasted instruction loads Automatic profiling Prediction in cold code

11/12/ Out-of-the-Box Computing Patents pending What is prediction? Prediction is a micro-architecture mechanism to smooth the flow of instructions in today’s slow- memory and long-pipeline CPUs. Like caches, the prediction mechanism and its success or failure is invisible to the program. Present prediction methods work quite well in small, regular benchmarks run on bare machines. They break down when code has irregular flow of control, and when processes are started or switched frequently. Except in performance and power impact.

11/12/ Out-of-the-Box Computing Patents pending The Mill CPU The Mill is a new general-purpose commercial CPU family. The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite. This talk will explain: the problems that prediction is intended to alleviate how conventional prediction works the Mill CPU’s novel approach to prediction

11/12/ Out-of-the-Box Computing Patents pending Talks in this series 1.Encoding 2.The Belt 3.Cache hierarchy 4.Prediction 5.Metadata and speculation 6.Specification 7.… You are here Slides and videos of other talks are at: ootbcomp.com/docs

11/12/ Out-of-the-Box Computing Patents pending Caution Gross over-simplification! This talk tries to convey an intuitive understanding to the non-specialist. The reality is more complicated.

11/12/ Out-of-the-Box Computing Patents pending Branches vs. pipelines if (I == 0) F(); else G(); Do we call F() or G() ? loadI eql0 brfl lab callF … lab:callG … 32 cycles (Intel Pentium 4 Prescott) 5 cycles (Mill) cachedecodeexecuteschedule

11/12/ Out-of-the-Box Computing Patents pending Branches vs. pipelines if (I == 0) F(); else G(); cachedecodeexecuteschedule load Ieql 0brflstall call G More stall than work! loadI eql0 brfl lab callF … lab:callG …

11/12/ Out-of-the-Box Computing Patents pending loadI eql0 brfl lab callF … So we guess… if (I == 0) F(); else G(); cachedecodeexecuteschedule load Ieql 0brflcall G Guess right? No stall! inst lab: Guess to call G (correct) callG …

11/12/ Out-of-the-Box Computing Patents pending So we guess… if (I == 0) F(); else G(); cachedecodeexecuteschedule load Ieql 0brflcall F Guess to call F (wrong) inst Guess wrong? Mispredict stalls! lab: loadI eql0 brfl lab callF … callG …

11/12/ Out-of-the-Box Computing Patents pending So we guess… if (I == 0) F(); else G(); cachedecodeexecuteschedule call G Fix prediction: Call G inst stall Finally! lab: loadI eql0 brfl lab callF … callG …

11/12/ Out-of-the-Box Computing Patents pending How the guess works if (I == 0) F(); else G(); cachedecodeexecuteschedule lab: loadI eql0 brfl lab callF … callG …

11/12/ Out-of-the-Box Computing Patents pending How the guess works if (I == 0) F(); else G(); cachedecodeexecuteschedule load Ieql 0brflcall Finst lab: loadI eql0 brfl lab callF … callG …

11/12/ Out-of-the-Box Computing Patents pending How the guess works if (I == 0) F(); else G(); cachedecodeexecuteschedule load Ieql 0brflcall Finst lab: loadI eql0 brfl lab callF … callG … branch history table

11/12/ Out-of-the-Box Computing Patents pending How the guess works if (I == 0) F(); else G(); cachedecodeexecuteschedule load Ieql 0brflstall branch history table call G inst Many fewer stalls! lab: loadI eql0 brfl lab callF … callG …

11/12/ Out-of-the-Box Computing Patents pending So what’s it cost? When (as is typical): one instruction in eight is a branch the predictor guesses right 95% of the time the mispredict penalty is 15 cycles predict failure wastes 8.5% of cycles Simplest fix is to lower the miss penalty. Shorten the pipeline! Mill pipeline is five cycles, not 15. Mill misprediction wastes only 3% of cycles

11/12/ Out-of-the-Box Computing Patents pending The catch - cold code The guess is based on prior history with the branch. What happens if there is no prior history? Cold code == random guess In cold code: one instruction in eight is a branch the predictor guesses right 50% of the time the mispredict penalty is 15 cycles predict failure wastes 48% of cycles (23% on a Mill) Ouch!

11/12/ Out-of-the-Box Computing Patents pending But wait – it gets worse! Cold code means no relevant Branch History contents. It also means no relevant cache contents. cachedecodeexecuteschedule DRAM branch history table brflinst 15 cycles 300+ cycles

11/12/ Out-of-the-Box Computing Patents pending Miss cost in cold code In cold code, when: one instruction in eight is a branch the predictor guesses right 50% of the time the mispredict penalty is 15 cycles cache miss penalty is 300 cycles cache line is 64 bytes, 16 instructions cold misses waste 96% of cycles (94% on a Mill) Ouch!

11/12/ Out-of-the-Box Computing Patents pending What to do? Use bigger cache lines Internal fragmentation means no gain Fetch more lines per miss Cache thrashing means no gain Nothing technical works very well.

11/12/ Out-of-the-Box Computing Patents pending What to do? Choose short benchmarks! No problem when benchmark is only a thousand instructions Blame the software! Code bloat is a software vendor problem, not a CPU problem Blame the memory vendor! Memory speed is a memory vendor problem, not a CPU problem This approach works. (for some value of “works”)

11/12/ Out-of-the-Box Computing Patents pending Fundamental problems Don’t know how much to load from DRAM. Mill knows how much will execute. Can’t spot branches until loaded and decoded. Mill knows where branches are, in unseen code Can’t predict spotted branches without history. Mill can predict in never-executed code. The rest of the talk shows how the Mill does this.

11/12/ Out-of-the-Box Computing Patents pending Extended Basic Blocks (EBBs) EBB branch EBB Program counter EBB EBB chain The Mill groups code into Extended Basic Blocks, single-entry multiple-exit sequences of instructions. Branches can only target EBB entry points; it is not possible to jump into the middle of an EBB. Execution flows through a chain of EBBs

11/12/ Out-of-the-Box Computing Patents pending Predicting EBBs EBB branch With an EBB organization, you don’t have to predict each branch. Only one of possibly many branches will pass control out of the EBB – so predict which one. If control enters here - predict that control will exit here The Mill predicts exits, not branches.

11/12/ Out-of-the-Box Computing Patents pending Representing exits Code is sequential in memory inst and is held in cache lines which are also sequential

11/12/ Out-of-the-Box Computing Patents pending Representing exits inst There is one EBB entry point entry and one predicted exit point exit represented as the difference prediction

11/12/ Out-of-the-Box Computing Patents pending Representing exits inst There is one EBB entry point entry and one predicted exit point exit represented as the difference prediction Rather than a byte or instruction count, the Mill predicts: the number of cache lines the number of instructions in the last line line count 2 inst count 3

11/12/ Out-of-the-Box Computing Patents pending Representing exits line count 2 inst count 3 Predictions also contain: prediction offset of the transfer target from the entry point kind – jump, return, inner call, outer call target offset 0xabcd kind jump “When we enter the EBB: fetch two lines, decode from the entry through the third instruction in the second line, and then jump to (entry+0xabcd)”

11/12/ Out-of-the-Box Computing Patents pending The Exit Table line count 2 inst count 3 target 0xabcd kind jump

11/12/ Out-of-the-Box Computing Patents pending The Exit Table line count 2 inst count 3 target 0xabcd kind jump Predictions are stored in the hardware Exit Table exit table pred The Exit Table: is direct-mapped, with victim buffers is keyed by the EBB entry address and history info has check bits to detect collisions can use any history-based algorithm Capacity varies by Mill family member

11/12/ Out-of-the-Box Computing Patents pending Exit chains Starting with an entry point, the Mill can chain through successive predictions without actually looking at the code. Exit Table entry address 123 probe Exit Table using entry address as key returning the keyed prediction

11/12/ Out-of-the-Box Computing Patents pending to get the next EBB entry address Exit chains Starting with an entry point, the Mill can chain through successive predictions without actually looking at the code. Exit Table entry address add the offset to the EBB entry address

11/12/ Out-of-the-Box Computing Patents pending rinse and repeat Exit chains Starting with an entry point, the Mill can chain through successive predictions without actually looking at the code. Exit Table entry address Repeat until: no prediction in table entry seen before (loop) as far as you wanted to go

11/12/ Out-of-the-Box Computing Patents pending Prefetch Exit Table Predictions chained from the Exit Table are prefetched from memory Prefetcher pred entry addrline count Cache/DRAM Prefetches cannot fault or trap, instead stops chaining Prefetches are low priority, use idle cycles to memory

11/12/ Out-of-the-Box Computing Patents pending The Prediction Cache After prefetch, chain predictions are stored in the Prediction Cache pred Prefetcher Prediction cache The Prediction Cache is small, fast, and fully associative. Chaining from the Exit Table stops if a prediction is found to be already in the Cache, typically a loop. Chaining continues in the cache, possibly looping; a miss resumes from the Exit Table.

11/12/ Out-of-the-Box Computing Patents pending The Fetcher Predictions are chained from the Prediction Cache (following loops) to the Fetcher Prediction cache Prefetcher

11/12/ Out-of-the-Box Computing Patents pending The Fetcher pred Fetcher entry addr line count Cache/DRAM Microcache Lines are fetched from the regular cache hierarchy to a microcache attached to the decoder. Prediction cache Predictions are chained from the Prediction Cache (following loops) to the Fetcher

11/12/ Out-of-the-Box Computing Patents pending Microcache The Decoder Prediction chains end at the Decoder, which also receives a stream of the corresponding cache lines from the Microcache. Fetcher pred Decoder The result is that the Decoder has a queue of predictions, and another queue of the matching cache lines, that are kept continuously full and available. It can decode down the predicted path at the full 30+ instructions per cycle speed.

11/12/ Out-of-the-Box Computing Patents pending Timing DecoderExit Table Prediction Cache Prefetcher Microcache Fetcher 3 cycles2 cycles mispredict penalty Vertically aligned units work in parallel Once started, the predictor can sustain one prediction every three cycles from the Exit Table.

11/12/ Out-of-the-Box Computing Patents pending Fundamental problems redux Don’t know how much to load from DRAM. Mill knows how much will execute. Can’t spot branches until loaded and decoded. Mill knows where branches are, in unseen code Can’t predict spotted branches without history. Mill can predict in never-executed code.

11/12/ Out-of-the-Box Computing Patents pending Prediction feedback All predictors use feedback from execution experience to alter predictions to track changing program behavior. ExecuteExit Table pred If a prediction was wrong, then it can be changed to predict what actually did happen. Exit Table contents reflects current history for all contained predictions.

11/12/ Out-of-the-Box Computing Patents pending “All contained predictions”? Not one prediction for each EBB in the program? No! Tables are much to small to hold predictions for all EBBs. In a conventional branch predictor, each prediction is built up over time with increasing experience with the particular branch conventional branch table prediction experience But if the CPU is switched to another process, the prediction is thrown away and overwritten. Every process switch is followed by a period of poor predictions while experience is built up again.

11/12/ Out-of-the-Box Computing Patents pending ? A second source of predictions Like others, the Mill builds predictions from experience. However, it has a second source: the program load module. code static data predic tions program load module The load module is used when there is no experience. exit table Missing predictions are read from the load module. key to decode

11/12/ Out-of-the-Box Computing Patents pending But there’s a catch… Loading a prediction from DRAM (or even L2 cache) takes much longer than a mispredict penalty! By the time it’s loaded we no longer need it! Solution: load bunches of likely-needed predictions code static data predic tions program load module exit table key But – what predictions are likely-needed?

11/12/ Out-of-the-Box Computing Patents pending Likely-needed predictions Should we load on a misprediction? No. We have a prediction – it’s just wrong. Should we load on a missing prediction? No. It may only be a rarely-taken path that aged out of the table. We should bulk-load only when entering a whole new region of program activity that we haven’t been to before (recently), and may stay in for a while, or re-enter. Like a function.

11/12/ Out-of-the-Box Computing Patents pending Likely-needed predictions The Mill bulk-loads the predictions of a function when the call finds no prediction for the entry EBB. int main() { phase1(); phase2(); phase3(); return 0; } Each call triggers loading of the predictions for the code of that function. exit table

11/12/ Out-of-the-Box Computing Patents pending Program phase-change At phase change ( or just code that was swapped out long enough ): 1.Recognize when a chain or misprediction leads to a call for which there is no Exit Table entry. 2.Bulk load the predictions for the function. 3.Start the prediction chain in the called function 4.Chaining will prefetch the predicted code path 5.Execute as fast as the code comes in. Overall delay: one load time for the first predictions one load time for the initial code prefetch two loads total - everything after that in parallel Vs. convential: one code load time per branch

11/12/ Out-of-the-Box Computing Patents pending Where’s the load module get its predictions? The compiler can perfectly predict EBBs that contain no conditional branches. Calls, returns and jumps A profiler can measure conditional behavior. But instrumenting the load module changes the behavior. So the Mill does it for you. Exit table hardware logs experience with predictions. Post-processing of the log updates the load module. Log info is available for JITs and optimizers. Mill programs get faster every time they run.

11/12/ Out-of-the-Box Computing Patents pending The fine print Newly-compiled predictions assume every EBB will execute to the final transfer. This policy causes all cache lines of the EBB to be prefetched, improving performance at the expense of loading unused lines. Later experience corrects the line counts. When experience shows that an EBB in a function is almost never entered (often error code) then it is omitted from the bulk load list, saving Exit Table space and memory traffic.

11/12/ Out-of-the-Box Computing Patents pending Fundamental problem summary Don’t know how much to load from DRAM. Mill knows how much will execute. Can’t spot branches until loaded and decoded. Mill knows where the exits are. Can’t predict spotted branches without history. Mill can predict in never-executed code. Mill programs get faster every time they run.

11/12/ Out-of-the-Box Computing Patents pending Shameless plug For technical info about the Mill CPU architecture: ootbcomp.com/docs To sign up for future announcements, white papers etc. ootbcomp.com/mailing-list ootbcomp.com/investor-list