Time-predictability of a computer system Master project in progress By Wouter van der Put.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Multi-core architectures. Single-core computer Single-core CPU chip.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Hyper-Threading Technology Architecture and Micro-Architecture.
Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Processor Architecture
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture 5a: CPU architecture 101 boris.
ECE232: Hardware Organization and Design
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Computer Structure Multi-Threading
The University of Adelaide, School of Computer Science
INTEL HYPER THREADING TECHNOLOGY
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Intel’s Core i7 Processor
Flow Path Model of Superscalars
Hyperthreading Technology
Hardware Multithreading
* From AMD 1996 Publication #18522 Revision E
Hardware Multithreading
Presentation transcript:

Time-predictability of a computer system Master project in progress By Wouter van der Put

2 How long does it take?

3 Goal Problem, approach and final goal l Problem ä How to meet timing requirements on an x86 multi- core multi-CPU computer system? l Method ä Investigate, characterize and give advice to increase the time-predictability of x86 multi-core multi-CPU computer systems l Final goal ä Advise how to maximise time-predictability, minimise latency and maximise throughput

4 Overview Time-predictability l Influenced by (bottom-up approach) ä Hardware –Processor (architecture) –Memory (hierarchy) –System architecture (motherboard) ä Software –Operating System (scheduling) –Algorithms and their data (regularity) l Approach ä Theory: Explore (CPU) architectures ä Practice: Perform measurements ä Conclusion l Focus on contemporary architecture ä Quad-core dual-CPU Intel Nehalem server (next slide)

5 Time-predictability White = observed Black = reality

6 Overview Nehalem/Tylersburg architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

7 Overview Processor QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

8 Processor Theory – Time-predictability l Designed to improve average case latency ä Memory access –Caches: reduce average access time ä Hazards –Prediction: reduce average impact l Complexity increases ä Time-predictability almost impossible to describe ä Instruction Set Architecture expands (next slide)

9 Processor Theory – Historical overview

10 Processor Theory – Nehalem architecture l In novel processors ä Core i7 & Xeon 5500s l 3 cache levels l 2 TLB levels l 2 branch predictors l Out-of-Order execution l Simultaneous Multithreading l Loop stream decoder l Dynamic frequency scaling

11 Processor Theory – Nehalem pipeline (1/2) Instruction Fetch and PreDecode Instruction Queue Decode Rename/Alloc Retirement unit (Re-Order Buffer) Scheduler EXE Unit Clust er 0 EXE Unit Clust er 1 EXE Unit Clust er 5 LoadStore L1D Cache and DTLB L2 Cache Inclusive L3 Cache by all cores Micro- code ROM Q PI

12 Processor Theory – Nehalem pipeline (2/2)

13 Processor Theory – Hazards l Negative impact on time-predictability ä Data hazards –RAW & WAR & WAW ä Structural hazards –Functional unit in use l Stall l SMT ä Control hazards –Exception and interrupt handling l Irregular –Branch hazards l Branch misprediction penalty (next page)

14 Processor Practice – Branch prediction for (a=0;a< ;a++) { if random<BranchPoint DoSomething; else DoSomething; } //BranchPoint=0%...100% Lower latency by max 30%

15 Processor Conclusion l Branch prediction ä Make your branches predictable –Lower latency by max 30% ä If input-dependent –Decreases time-predictability l Other features increase throughput, but decrease time-predictability ä Out-of-Order execution ä Simultaneous Multithreading ä Loop stream decoder ä Dynamic frequency scaling

16 Overview Memory hierarchy QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

17 Memory hierarchy Theory – Overview (1/2) LevelCapacityAssociativity (ways) Line Size (bytes) Access Latency (clocks) Access Throughput (clocks) Write Update Policy L1D4 x 32 KiB86441Writeback L1I4 x 32 KiB4N/A L2U4 x 256 KiB86410VariesWriteback L3U1 x 8 MiB VariesWriteback

18 Memory hierarchy Theory – Overview (2/2) Hit rateAccess time L1$95%4,000 clock cycles L2$95%10,000 clock cycles L3$95%40,000 clock cycles Mem100,000 clock cycles Minimum4,000 clock cycles Average4,383 clock cycles Maximum100,000 clock cycles l Goal ä Minimise average latency l Result ä Program (and input) influences hit rate and thus average latency ä Input may influence time- predictability

19 l Negative impact on time-predictability ä Locality of reference –Temporal locality –Spatial locality l Sequential locality l Equidistant locality l Branch locality ä Write policy –Write-through (Latency: Write = 1, Read = 1) –Write-back (Latency: Write = 0, Read = 2) Memory hierarchy Theory – Caches (1/2)

20 Memory hierarchy Theory – Caches (2/2) l Negative impact on time-predictability ä Cache types –Instruction cache –Data cache –Translation Lookaside Buffer (TLB) ä (Non-)Blocking caches ä Replacement policy –Fully associative –N-way set associative –Direct mapped (1-way associative)

21 Memory hierarchy Practice – Method.code start: mov eax, alloc( ) mov ecx, 0 loopy: mov ebx, [eax ] mov ebx, [eax ]... (100,000x) mov ebx, [eax ] mov ebx, [eax ] inc ecx cmp ecx, jnz loopy free eax exit end start Assembly: no compiler Begin Allocate variable number of bytes For ecx = 0 to BIG_NUMBER (run 10s) Read random data from array... (100,000x) Read random data from array Next ecx Free memory End

22 Memory hierarchy Practice – Results (1/3)

23 Memory hierarchy Practice – Results (2/3)

24 Memory hierarchy Practice – Results (3/3)

25 Memory hierarchy Conclusion l Stay in the cache (here 4x32KiB L1 / 2x6MIB L2) e.g. by splitting large dataset into smaller pieces l Possible speed gain of more than 50x!

26 Overview System architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

27 System architecture Theory – Layout and limits l Limits ä DDR GB/s ä QPI - 2 x 13 GB/s ä PCIe Gen2 16x - 8 GB/s ä 10GbE - 1 GB/s ä SATA II MB/s ä USB MB/s

28 System architecture Practice – Results (1/4)

29 System architecture Practice – Results (2/4)

30 System architecture Practice – Results (3/4)

31 System architecture Practice – Results (4/4)

32 System architecture Conclusion l Divide load between NUMA nodes ä Cores in one node compete for memory bandwidth ä Increase throughput by number of nodes l Run one process on one core ä To increase time-predictability l Run time-critical process on core (and CPU) without interrupts ä Interrupts increase latency and decrease time- predictability

33 Overview Operating system QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

34 Operating System l Theory ä Multi tasking –Context switch –Virtual addressing (RAM → L2TLB → L1TLB) –Different process priorities (highly unpredictable) –Kernel ä General purpose / Real-time OS –Focus on predictable latency (not minimum) l Practice ä Low priority l Conclusion ä Run your program at high priority (on RTOS)

35 Conclusion l Processor ä Make your branches predictable (30%) l Memory hierarchy ä Stay in the cash (50x) l System architecture ä Divide load between NUMA nodes (Nx) ä Avoid interrupted core (and CPU) ä Run one process on one core l Operating System ä Run your program at high priority (on RTOS)