University of Texas at Austin

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
TxLinux: Using and Managing Hardware Transactional Memory in an Operating System Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan,
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel - Presentation By Sathish P.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
CSE 451: Operating Systems Autumn 2013 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
The Linux Kernel: A Challenging Workload for Transactional Memory Hany E. Ramadan Christopher J. Rossbach Emmett Witchel Operating Systems & Architecture.
Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Introduction to Operating Systems and Concurrency.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
Solving Difficult HTM Problems Without Difficult Hardware Owen Hofmann, Donald Porter, Hany Ramadan, Christopher Rossbach, and Emmett Witchel University.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
A BORTING C ONFLICTING T RANSACTIONS IN AN STM PP O PP’09 2/17/2009 Hany Ramadan, Indrajit Roy, Emmett Witchel University of Texas at Austin Maurice Herlihy.
Real-Time Operating Systems RTOS For Embedded systems.
Introduction to Operating Systems Concepts
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Processes and threads.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Transactional Memory : Hardware Proposals Overview
Outline Other synchronization primitives
Mechanism: Limited Direct Execution
Other Important Synchronization Primitives
Effective Data-Race Detection for the Kernel
Journaling File Systems
Morgan Kaufmann Publishers
Operating Systems and Systems Programming
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Department of Computer Science University of California, Santa Barbara
Chapter 2: System Structures
Xen Network I/O Performance Analysis and Opportunities for Improvement
Lecture 6: Transactions
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
CSE 451: Operating Systems Spring 2012 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.
Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E
Today’s agenda Hardware architecture and runtime system
Thread Implementation Issues
/ Computer Architecture and Design
Hybrid Transactional Memory
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
Speculative execution and storage
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
CS510 Operating System Foundations
CSE 471 Autumn 1998 Virtual memory
CSE 153 Design of Operating Systems Winter 2019
Lecture 23: Transactional Memory
Department of Computer Science University of California, Santa Barbara
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:
Concurrent Cache-Oblivious B-trees Using Transactional Memory
CS703 – Advanced Operating Systems
Controlled Interleaving for Transactions
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

University of Texas at Austin MetaTM & TxLinux Hany Ramadan, Christopher Rossbach, Donald Porter, Owen Hofmann, Aditya Bhandari, Emmett Witchel University of Texas at Austin Good afternoon. My name is Hany Ramadan, Today I’ll be presenting MetaTM & TxLinux, which are systems built at UT Austin by Dr. Witchel’s group. At a high level, this talk is about Transactional Memory & Operating Systems.

TM Background Transactional programming is an emerging alternative to locks Avoids problems such as deadlock Avoids performance-complexity tradeoffs HTM holds the promise of simpler programming and good performance I’ll begin with Transactional Memory. It is no secret that Tx Prog is a promising new technique for managing concurrency in parallel systems, with great potential compared to existing techniques such as coarse and fine-grained locking. It is very exciting to us that Programmers can be free’d from having to choose between bad performance, or programming complexity. Now I also mentioned Operating Systems, so you might wonder, what on earth could such dinosaurs have to do with TM?

TM: “What’s the OS got to do with it?” Lack of realistic workloads (counter, splash-2) Will current results hold on real programs? Unclear design tradeoffs; Feature set unsettled OS is a real-life, parallel workload OS will benefit from transactions Reduces synchronization complexity System-call and interrupt control paths will benefit Architectural support is needed for OS Well, I’m glad you asked. We believe there is a mutually beneficial relationship bw TM & OS On one hand, the lack of realistic workloads has been a chronic weakness of TM research, with implications for the validity of results and the appropriateness of designs. We argue that the OS is a realistic workload. It’s a parallel program, which we all use. So it will benefit TM research to take a sustained look at the OS as a workload. In return, OS [complex sw] stand to benefit from what TM offers. To develop reltn. Additional Architectural Support is needed [rel simpl] It’s worth taking a look at this relationship in a sustained fashion. Spiderman: With great realistic workloads, comes great number of Tx.

Average Transaction Count This is just to give you a rough idea of what we’re talking about It’s hard to find transaction counts, and especially rates in existing work, but this is from what we did find, because they only provide normalized results. Obviously it’s not about how long you run your experiments, but more seriously, 1) are there any functional issues or requirements that will not come out from smaller programs, and 2) Results and intuitions from small progs. hold in realistic programs?

Outline TxLinux MetaTM Issue: Stack memory Experimental results Goals Features Interrupt handling Issue: Stack memory Experimental results Stack memory = To give you a flavor of the kinds of issues we had to deal with, that arise from the interaction of TxLinux & MetaTM.

TxLinux 2.6.16.1 Sequence locks RCU (read-copy- update) Spin-locks Converted ~30% of dynamic synchronization to transactions Sequence locks RCU (read-copy- update) Spin-locks TxLinux is our transactional variant of the Linux kernel. It’s based on a recent 2.6 version of Linux. Kernel has ~ a dozen different sync. primitives used throughout code base. Our approach in developing TxLinux was to focus on some of the heavily used primitives, such as spinlocks, and introduce tx variants of them. We then modified components in several subsystems (such as FS, networking), to use the transactional APIs. Overall, we converted about 30% of dynamic synchronization to tx’s. We didn’t reach 100% for variety of reasons, for instance we didn’t allow transactions to do I/O. Stress that in TxLinux Tx co-exist with locks. Slab allocator Directory cache IP routing Pathname translation Socket locking Zone allocator Various MM structures File system Networking Memory management

MetaTM: Design goals HTM model co-designed with TxLinux Extensions to x86 ISA Architectural support for OS Execution-driven simulation A platform for TM research Multiple HTM design points Eager & lazy version management Eager conflict detection Platform: Configurable policies, costs

MetaTM: Model features xbegin xend Tx demarcation xpush xpop Multiple Tx polite karma eruption Contention management (eager) timestamp polka sizematters Here are some of the main features. Xpush & Xpop provide support for suspending transactions, I’ll give more details on them shortly. exponential linear random Backoff policy Version management commit cost (lazy) abort cost (eager)

TxLinux: Interrupt handling Question: What happens to active tx on an interrupt? Interrupt handlers allowed to use transactions Factors weighing against abort Transaction length growing Interrupt frequency Answer: Active transactions are suspended on interrupt (~900 cycles) cycles (~25 kcycles) interrupts

MetaTM: Multiple Tx support Multiple active transactions on a processor At most one running, all others are suspended Interface xpush suspends current transaction xpop resumes suspended transaction Suspended transactions maintained in LIFO order New execution context is unrelated to old one Same conflict semantics with all other transactions May start new transactions New exec context: Cannot see isolated state from the xpush’d transaction. This distinguishes it from various nesting paradigms.

Outline TxLinux MetaTM Issue: Stack memory Experimental results Goals Features Interrupt handling Issue: Stack memory Experimental results

Issue: Stack memory Transactions can span stack frames Why: Retain same flexibility as locks Problem: Live stack overwrite (correctness) Solution: Stack Pointer Checkpoint foo() { atomic } foo() { bar() baz() } bar() { xbegin } baz() { xend } We support this in MetaTM, because: Linux does this, and We didn’t want to rewrite the OS There’s another stack problem addressed in the paper. Stack included in transaction working set Why: Stack shared (e.g. interrupts, etc.) Problem: Dead stack restart (performance) Solution: Stack-based early release

Live stack overwrite 0xC0 locals locals 0x80 foo+8 intr state 0x40 Error: invalid return address locals foo locals foo StkPtr foo+4: call bar foo+8: <work> foo+12:xend bar+0: xbegin bar+4: ret do_irq: iret 0x80 foo+8 bar intr state do_IRQ 0x40 0x00 Tx Reg. Checkpoint PC: bar+4 StkPtr: 0x40 (other regs..) Conflict Only interrupts that arrive in kernel mode have this problem

Live stack overwrite, fixed 0xC0 locals foo locals foo StkPtr foo+4: call bar foo+8: <work> foo+12:xend bar+0: xbegin bar+4: ret do_irq: iret 0x80 foo+8 bar 0x40 intr state do_irq This can happen in user mode when using signals; Detailed, but this is kind of stuff that you see and must fix to boot & run an OS. 0x00 Tx Reg. Checkpoint PC: bar+4 StkPtr: 0x40 (other regs..) Conflict Fixed by setting ESP to Checkpointed ESP on interrupt

Outline TxLinux MetaTM Issue: Stack memory Experimental results Goals Features Interrupt handling Issue: Stack memory Experimental results

Experiments Setup Workloads System characteristics Studies Execution time Transaction rates Transaction origins Studies Contention management Commit & Abort penalties

Setup Simics 3.0.17 8-processor, x86 system (1 Ghz) Memory hierarchy L1: sep D/I, 16KB, 4-way, 1-cycle hit L2: 4MB, 8-way, 16-cycle hit, MESI protocol Main memory: 1GB, 200-cycle hit Other devices Disk device (DMA, 5.5ms latency) Tigon3 gigabit nic (DMA,0.1ms latency)

Workloads to exercise TxLinux counter shared counter micro- benchmark (8 threads) pmake Runs make -j 8 to compile files from libFLAC 1.1.2 netcat streams data over TCP network conn. MAB simulates software development file system workloads configure 8 instances of configure for tetex find 8 instances of find on a 78MB directory searching for text Note: Only TxLinux creates transactions

Kernel Execution Time We measure Kernel Execution time (since that is what we modified) Benchmarks run from 1 to 10 seconds of Kernel time, does not include Boot. Counter shows a good 2x speedup with Tx. This is similar to speedups others have reported with micro-benchmarks. On the up-side, performance did not degrade. Linux doesn’t spend a lot of time in sychronization at 8 processors. We also show proportion of CPU time in Kernel. While it differs from app to app, at least on these applications, ~ 1/2 time in kernel One more reason for allowing OS to use transaction; capabilities of processor. [Idle time in Pmake & netcat] counter %Kern. time 91% 13% 54% 57% 43% 50% High kernel time justifies transactions in the OS

Transaction Rates Restart Rate 2.6% 3.1% 1.7% 2.1% 10.2% Log scale Restart Rate 2.6% 3.1% 1.7% 2.1% 10.2% Find workload has highest contention in TxLinux

Transaction Origins This IS the Xpush / Xpop slide. TODO HANY must mention this. We put no special effort into putting these. Interrupt handling path, which other proposals in this slide. Kernel locks accessed from both system call and interrupt handling contexts

Contention Management Study counter Normalized to TIMESTAMP Counter example, how it doesn’t show variances. [Restart rate -> Perf ] Polka best, SizeMatters viable. Polka best performer, but complex to implement; SizeMatters viable Stall-on-conflict – reduces conflicts, but not always performance

Commit & Abort Study Commit Cost Abort Cost Normalized Kernel Time Abort Cost Normalized kernel time (to 0 cost). Y-Axis is different. And here we see in the Abort cost, the counterintuitive result that high abort cost can function as a backoff, yielding situation where larger abort cost results in an improvement cost. Normalized Kernel Time Performance sensitive to commit penalty, not abort Confirms benefit of eager version management (fast commits)

Related Work TM Models Suspension techniques Interrupt handling TCC [Hammond04], UTM [Anaian05], LogTM [Moore06], VTM [Rajwar05] Suspension techniques Escape actions [Zilles06] – can’t start tx Interrupt handling XTM [Chung06] – also tries to avoid aborts Contention management Scherer & Scott [PODC’05] – in STM context TM Models haven’t allowed OS transactions. UTM took traces from Linux 2.4 Escape actions = limited, can’t start tx XTM = multi-pronged approach, try to avoid aborts

Conclusions TM needs realistic workloads OS needs TM TxLinux the largest TM benchmark OS needs TM Complex synchronization; large % of runtime Building & running TxLinux reveals much Architectural support needed (Tx suspension) Contention management is important Cost studies confirm fast commits … more in the paper