Systems & networking MSR Cambridge Tim Harris 2 July 2009.

Slides:

Advertisements

Similar presentations

Virtualisation From the Bottom Up From storage to application.

Advertisements

CSCI 4717/5717 Computer Architecture

OS Fall’02 Virtual Memory Operating Systems Fall 2002.

EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM B. Bershad, S. Savage, P. Pardyak, E. G. Sirer, D. Becker, M. Fiuczynski, C. Chambers,

Computer Organization and Architecture

May 7, A Real Problem  What if you wanted to run a program that needs more memory than you have?

2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Allocation Methods - Contiguous

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 A Real Problem  What if you wanted to run a program that needs more memory than you have?

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.

Migrating Server Storage to SSDs: Analysis of Tradeoffs Dushyanth Narayanan Eno Thereska Austin Donnelly Sameh Elnikety Antony Rowstron Microsoft Research.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

Securing software by enforcing data-flow integrity Manuel Costa Joint work with: Miguel Castro, Tim Harris Microsoft Research Cambridge University of Cambridge.

Lecture 11: Memory Management

Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.

Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Memory Management 2010.

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Lecture 17: Virtual Memory, Large Caches

Migrating Server Storage to SSDs: Analysis of Tradeoffs

Computer Organization and Architecture

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.

Computer System Architectures Computer System Software

Operating System Chapter 7. Memory Management Lynn Choi School of Electrical Engineering.

The Multikernel: A new OS architecture for scalable multicore systems

Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.

Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Processes Introduction to Operating Systems: Module 3.

02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.

November 23, A Real Problem  What if you wanted to run a program that needs more memory than you have?

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Review °Apply Principle of Locality Recursively °Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Full and Para Virtualization

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.

Tackling I/O Issues 1 David Race 16 March 2010.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Memory COMPUTER ARCHITECTURE

Diskpool and cloud storage benchmarks used in IT-DSS

Lecture: Large Caches, Virtual Memory

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Alternative system models

The Multikernel A new OS architecture for scalable multicore systems

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Chapter 2: Operating-System Structures

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 2: Operating-System Structures

Cache writes and examples

Presentation transcript:

Systems & networking MSR Cambridge Tim Harris 2 July 2009

Multi-path wireless mesh routing 2

Epidemic-style information distribution 3

Development processes and failure prediction 4

Better bug reporting with better privacy 5

Multi-core programming, combining foundations and practice 6

Data-centre storage 7

Barrelfish: a sensible OS for multi-core hardware What place for SSDs in enterprise storage? WIT: lightweight defence against malicious inputs 8

Software is vulnerable Unsafe languages are prone to memory errors – many programs written in C/C++ Many attacks exploit memory errors – buffer overflows, dangling pointers, double frees Still a problem despite years of research – half of all the vulnerabilities reported by CERT 9

Problems with previous solutions Static analysis is great but insufficient – finds defects before software ships – but does not find all defects Runtime solutions that are used – have low overhead but low coverage Many runtime solutions are not used – high overhead – changes to programs, runtime systems 10

WIT: write integrity testing Static analysis extracts intended behavior – computes set of objects each instruction can write – computes set of functions each instruction can call Check this behavior dynamically – write integrity prevents writes to objects not in analysis set – control-flow integrity prevents calls to functions not in analysis set 11

WIT advantages Works with C/C++ programs with no changes No changes to the language runtime required High coverage – prevents a large class of attacks – only flags true memory errors Has low overhead – 7% time overhead on CPU benchmarks – 13% space overhead on CPU benchmarks 12

char cgiCommand[1024]; char cgiDir[1024]; void ProcessCGIRequest(char* msg, int sz) { int i=0; while (i < sz) { cgiCommand[i] = msg[i]; i++; } ExecuteRequest(cgiDir, cgiCommand); } 13 non-control-data attack Example vulnerable program buffer overflow in this function allows the attacker to change cgiDir

Write safety analysis Write is safe if it cannot violate write integrity – writes to constant offsets from stack pointer – writes to constant offset from data segment – statically determined in-bounds indirect writes Object is safe if all writes to object are safe For unsafe objects and accesses char array[1024]; for (i = 0; i < 10; i++) array[i] = 0; // safe write

Colouring with static analysis WIT assigns colours to objects and writes – each object has a single colour – all writes to an object have the same colour – write integrity ensure colors of write and its target match Assigns colours to functions and indirect calls – each function has a single colour – all indirect calls to a function have the same colour – control-flow integrity ensure colours of i-call and its target match 15

Colouring Colouring uses points-to and write safety results –start with points-to sets of unsafe pointers –merge sets into equivalence class if they intersect –assign distinct colour to each class p1p2 p3 16

Colour table Colour table is an array for efficient access –1-byte colour for each 8-byte memory slot –one colour per slot with alignment –1/8 th of address space reserved for table 17

Inserting guards WIT inserts guards around unsafe objects –8-byte guards –guard’s have distinct colour: 1 in heap, 0 elsewhere 18

Write checks Safe writes are not instrumented Insert instrumentation before unsafe writes lea edx, [ecx] ; address of write target shr edx, 3; colour table index  edx cmp byte ptr [edx], 8 ; compare colours je out; allow write if equal int 3 ; raise exception if different out: mov byte ptr [ecx], ebx; unsafe write 19

char cgiCommand[1024]; {3} char cgiDir[1024]; {4} void ProcessCGIRequest(char* msg, int sz) { int i=0; while (i < sz) { cgiCommand[i] = msg[i]; i++; } ExecuteRequest(cgiDir, cgiCommand); } 20 lea edx, [ecx] shr edx, 3 cmp byte ptr [edx],3 je out int 3 out: mov byte ptr [ecx], ebx attack detected, guard colour ≠ object colour attack detected even without guards – objects have different colours ≠ ≠

Evaluation Implemented as a set of compiler plug-ins – Using the Phoenix compiler framework Evaluate: – Runtime overhead on SPEC CPU,Olden benchmarks – Memory overhead – Ability to prevent attacks 21

Runtime overhead SPEC CPU 22

Memory overhead SPEC CPU 23

Ability to prevent attacks WIT prevents all attacks in our benchmarks – 18 synthetic attacks from benchmark Guards sufficient for 17 attacks – Real attacks SQL server, nullhttpd, stunnel, ghttpd, libpng 24

Barrelfish: a sensible OS for multi-core hardware What place for SSDs in enterprise storage? WIT: lightweight defence against malicious inputs 25

Solid-state drive (SSD) NAND Flash memory Flash Translation Layer (FTL) Block storage interface Persistent Random-access Low power 26

Enterprise storage is different Laptop storage Form factor Single-request latency Ruggedness Battery life Enterprise storage Fault tolerance Throughput Capacity Energy ($) 27

Replacing disks with SSDs Disks $$ Match performance Flash $ 28

Replacing disks with SSDs Disks $$ Match capacity Flash $$$$$ 29

Challenge Given a workload – Which device type, how many, 1 or 2 tiers? We traced many real enterprise workloads Benchmarked enterprise SSDs, disks And built an automated provisioning tool – Takes workload, device models – And computes best configuration for workload 30

High-level design 31

Devices (2008) DevicePriceSizeSequential throughput R’-access throughput Seagate Cheetah 10K$ GB85 MB/s288 IOPS Seagate Cheetah 15K$ GB88 MB/s384 IOPS Memoright MR25.2$73932 GB121 MB/s6450 IOPS Intel X25-E (2009)$41532GB250 MB/s35000 IOPS Seagate Momentus 7200$53160 GB64 MB/s102 IOPS 32

Device metrics MetricUnitSource Price$Retail CapacityGBVendor Random-access read rateIOPSMeasured Random-access write rateIOPSMeasured Sequential read rateMB/sMeasured Sequential write rateMB/sMeasured PowerWVendor 33

Enterprise workload traces Block-level I/O traces from production servers – Exchange server (5000 users): 24 hr trace – MSN back-end file store: 6 hr trace – 13 servers from small DC (MSRC) File servers, web server, web cache, etc. 1 week trace Below buffer cache, above RAID controller 15 servers, 49 volumes, 313 disks, 14 TB – Volumes are RAID-1, RAID-10, or RAID-5 34

Workload metrics MetricUnit CapacityGB Peak random-access read rateIOPS Peak random-access write rateIOPS Peak random-access I/O rate (reads+writes) IOPS Peak sequential read rateMB/s Peak sequential write rateMB/s Fault toleranceRedundancy level 35

Model assumptions First-order models – Ok for provisioning  coarse-grained – Not for detailed performance modelling Open-loop traces – I/O rate not limited by traced storage h/w – Traced servers are well-provisioned with disks – So bottleneck is elsewhere: assumption is ok 36

Single-tier solver For each workload, device type – Compute #devices needed in RAID array Throughput, capacity scaled linearly with #devices – Must match every workload requirement “Most costly” workload metric determines #devices – Add devices need for fault tolerance – Compute total cost 37

Two-tier model 38

Solving for two-tier model Feed I/O trace to cache simulator – Emits top-tier, bottom-tier trace  solver Iterate over cache sizes, policies – Write-back, write-through for logging – LRU, LTR (long-term random) for caching Inclusive cache model – Can also model exclusive (partitioning) – More complexity, negligible capacity savings 39

Single-tier results Cheetah 10K best device for all workloads! SSDs cost too much per GB Capacity or read IOPS determines cost – Not read MB/s, write MB/s, or write IOPS – For SSDs, always capacity – For disks, either capacity or read IOPS Read IOPS vs. GB is the key tradeoff 40

Workload IOPS vs GB 41

SSD break-even point When will SSDs beat disks? – When IOPS dominates cost Break even price point (SSD $/GB ) is when – Cost of GB (SSD) = Cost of IOPS (disk) Our tool also computes this point – New SSD  compare its $/GB to break-even – Then decide whether to buy it 42

Break-even point CDF 43

Break-even point CDF 44

Break-even point CDF 45

SSD as intermediate tier? Read caching benefits few workloads – Servers already cache in DRAM – SSD tier doesn’t reduce disk tier provisioning Persistent write-ahead log is useful – A small log can improve write latency – But does not reduce disk tier provisioning – Because writes are not the limiting factor 46

Power and wear SSDs use less power than Cheetahs – But overall $ savings are small – Cannot justify higher cost of SSD Flash wear is not an issue – SSDs have finite #write cycles – But will last well beyond 5 years Workloads’ long-term write rate not that high You will upgrade before you wear device out 47

Conclusion Capacity limits flash SSD in enterprise – Not performance, not wear Flash might never get cheap enough – If all Si capacity moved to flash today, will only match 12% of HDD production – There are more profitable uses of Si capacity Need higher density/scale (PCM?) 48

Barrelfish: a sensible OS for multi-core hardware What place for SSDs in enterprise storage? WIT: lightweight defence against malicious inputs 49

Don’t these look like networks to you? Intel Larrabee 32-core Tilera TilePro64 CPU AMD 8x4 hyper-transport system 50

Communication latency 51

Communication latency 52

Node heterogeneity Within a system: – Programmable NICs – GPUs – FPGAs (in CPU sockets) Architectural differences on a single die: – Streaming instructions (SIMD, SSE, etc.) – Virtualisation support, power management – Mix of “large/sequential” & “small/concurrent” core sizes Existing OS architectures have trouble accommodating all this 53

Dynamic changes Hot-plug of devices, memory, (cores?) Power-management Partial failure 54

Extreme position: clean slate design Fully explore ramifications No regard for compatibility What are the implications of building an OS as a distributed system? 55

The multikernel architecture 56

Why message passing? We can reason about it Decouples system structure from inter-core communication mechanism – Communication patterns explicitly expressed – Naturally supports heterogeneous cores – Naturally supports non-coherent interconnects (PCIe) Better match for future hardware –... cheap explicit message passing (e.g. TilePro64) –... non-cache-coherence (e.g. Intel Polaris 80-core) 57

Message passing vs. shared memory Access to remote shared data can form a blocking RPC – Processor stalled while line is fetched or invalidated – Limited by latency of interconnect round-trips Performance scales with size of data (#cache lines) By sending an explicit RPC (message), we: – Send a compact high-level description of the operation – Reduce the time spent blocked, waiting for the interconnect Potential for more efficient use of interconnect bandwidth 58

Sharing as an optimisation Re-introduce shared memory as optimisation – Hidden, local – Only when faster, as decided at runtime – Basic model remains split-phase messaging But sharing/locking might be faster between some cores – Hyperthreads, or cores with shared L2/3 cache 59

Message passing vs. shared memory: tradeoff 2 x 4-core Intel (shared bus) Shared: clients modify shared array (no locking!) Message: URPC to a single server 60

Replication Given no sharing, what do we do with the state? Some state naturally partitions Other state must be replicated Used as an optimisation in previous systems: – Tornado, K42 clustered objects – Linux read-only data, kernel text We argue that replication should be the default 61

Consistency How do we maintain consistency of replicated data? Depends on consistency and ordering requirements, e.g.: – TLBs (unmap) single-phase commit – Memory reallocation (capabilities) two-phase commit – Cores come and go (power management, hotplug) agreement 62

A concrete example: Unmap (TLB shootdown) “Send a message to every core with a mapping, wait for all to be acknowledged” Linux/Windows: – 1. Kernel sends IPIs – 2. Spins on shared acknowledgement count/event Barrelfish: – 1. User request to local monitor domain – 2. Single-phase commit to remote cores Possible worst-case for a multikernel How to implement communication? 63

Three different Unmap message protocols... Unicast Multi cast... Same package ( shared L3) More hyper-transport hops... cache-lines Write R ead... Broadcast 64

Choosing a message protocol on 8x4 AMD... 65

Total Unmap latency for various OSes 66

Heterogeneity Message-based communication handles core heterogeneity – Can specialise implementation and data structures at runtime Doesn’t deal with other aspects – What should run where? – How should complex resources be allocated? Our prototype uses constraint logic programming to perform online reasoning System knowledge base stores rich, detailed representation of hardware performance 67

Current Status Ongoing collaboration with ETH-Zurich – Several keen PhD students working on a variety of aspects Prototype multi-kernel OS implemented: Barrelfish – Runs on emulated and real hardware – Smallish set of drivers – Can run web server, SQLite, slideshows, etc. Position paper presented at HotOS Full paper to appear at SOSP Likely public code release soon 68

Barrelfish: a sensible OS for multi-core hardware What place for SSDs in enterprise storage? WIT: lightweight defence against malicious inputs