Everything you always wanted to know about threading.... ... But were afraid to ask Elliot H. Omiya (EHO) PRINCIPAL SDE Windows Developer Experience
Agenda A short history of threading in Windows Threading in the Windows Runtime Threading and UI programming Async
Short History of Threading in Windows Anyone remember _beginthread(ex)?? Going back to VC++ 1.0 and the early to mid-90’s (16-bit real mode Windows) Eventually we sorted out CreateThread versus _beginthread issues and interactions with the CRT. Inevitably this led to an explosion of threads in Windows programs. Everyone started hand-rolling their own “thread pools” until we shipped ThreadPool in Windows NT. Dedicated threads (i.e. CreateThread) are still popular but harder for application to manage.
ThreadPools Simple: queue, “n” threads, and an algorithm to dequeue work and run it. Not simple: the actual implementation Job 1: queue (actually, this part is relatively simple) If you have priorities, one queue per priority level is a fine implementation Job 2: threads First decision: how many threads? One per core?
Worker threads for thread pool One per core is fine unless you have: IO Threads that can block on event/semaphore A “busy” thread (waiting on IO or event) can do no more work Result: you need a pool of threads greater than the # of cores because every worker thread can potentially block ... Not to mention pagefaults. Apps (user-mode) can’t possibly avoid being blocked.
Work types Turns out you need “work types” (in the internal implementation) TP_WORK TP_WAIT TP_IO TP_TIMER etc. Basically these are “execution triggers” – how they start executing. You could also explicitly set up work items that you know do IO or will wait on an event/semaphore. So the API just speaks of work items, but there is additional complexity in the actual implementation.
Thread explosion There are several classic programming problems that can lead to thread explosion. Simple example: web server servicing web requests: “n” worker threads servicing incoming requests One worker thread to read actual html file and associated resources from disk (into cache) Simple programming error can lead to disastrous results.
Every worker thread is blocked on an event Work Item Work Item Work Item Work Item Event (web page read) Work Item Work Item Work Item Work Item ... Work Item Queue Worker Threads Every worker thread is blocked on an event
Work Item Work Item Work Item Work Item Event (web page read) Work Item Work Item Work Item Work Item ... Work Item Queue Worker Threads But the worker thread that signals the event is in the queue! (Bad code!)
So you have to create another thread in the pool... Work Item Work Item Work Item Work Item Event (web page read) Work Item Work Item Work Item Work Item ... Work Item Queue Worker Threads So you have to create another thread in the pool...
But you could have a large number of requests for the same web page... Work Item Work Item Work Item Work Item Event (web page read) Work Item Work Item Work Item Work Item ... Work Item Queue Worker Threads But you could have a large number of requests for the same web page...
Result: thread explosion Work Item Work Item Work Item Work Item Event (web page read) Work Item Work Item Work Item Work Item ... Work Item Queue Worker Threads Result: thread explosion
Thread explosion Turns out there are many variations of this problem. Another example: freezing threads for a GC operation We call this programming defect “popular internal dependency” Solution: create a short (or long) delay – (yes, a heuristic) Algorithm assumes the resource dependency will resolve (i.e. does not cover the programming error we previously described) Alleviates the thread explosion during this dependency resolution
Fairness Recap: “n” priority work queues and a system that does IO and has events/semaphores OS has a highly efficient signaling mechanism: IO completion ports – kernel knows when IO completes. Can also be used for events. So kernel knows immediately when a work thread that was blocked is ready to run (ReadyThread). It can release work to the user mode side of the threadpool when it is idle.
Fairness Windows 7: TP_WORK always trumped TP_IO and TP_WAIT So lots of TP_WORK could starve work items that unblocked (i.e. unfair) A fairer algorithm blends TP_WORK, TP_IO, and TP_WAIT. Servicing “IRP’s” is a kernel mode concept so the “worker thread factory” must be in kernel mode.
Work Item Work Item Work Item Work Item Work Item Work Item Work Item Queue Work Item Queue User Kernel Work Work IO IO Wait ... Worker Thread Factory
Work Item Work Item Work Item Work Item Work Item Work Item Work Item Queue Work Item Queue User Kernel Work Work IO IO Wait ... Worker Thread Factory IRP’s complete
Transported and running Work Item Work Item Work Item Work Item Work Item Work Item Transported and running Work Item Queue Work Item Queue User Kernel Work Work IO IO Wait ... Worker Thread Factory
Fairness Fairness could only be implemented with kernel mode managing work items, IO, timers, and event waiters. Bonus: IRP’s complete rapidly, we can batch up and do a single transport for multiple work items (fewer ring transitions) Very difficult to balance throughput, resource management, and fairness in user mode only. This is a C++ conference, why do you care?
Your unique value add is the diverse types of work you do in ISO C++ On Windows, think of how your library functions can benefit from a highly optimized threadpool. On desktop, you can access CreateThread and the latest TP API’s On WinRT, you only have the threadpool. This is where you throw tomatoes at me.
I need dedicated threads I can control... We hear this a lot. Not every workload is amenable to (hopefully short) chunks of work items. “I have long running work”. Examples: Populate off-screen parts of game board (viewport scrolling optimization) Lazy layout (optimize start-up performance) (insert your favorite scenario here, you all have one or three)
Time-sliced The Windows Threadpool does have a dedicated thread model: time- sliced. Trade-off of latency versus throughput (and resource consumption) Using work items is efficient, many work items processed by relatively few threads. (batched mode). Using “time-siiced” has less latency but burns an expensive resource: a thread. (quantum mode) Plug: This is a summary of Pedro Texeira’s excellent talk on the threadpool on MSDN. Many more details there.
WinRT threadpool Windows::System::Threading::ThreadPool has support for: Run a single work item - RunAsync(WorkItemHandler^) Run a single work item based on timer - CreateTimer(TimerElapsedHandler^, delay) Run a periodic work item based on a timer – CreatePeriodicTimer(TimerElapsedHandler^, period) auto WorkItem = ref new WorkItemHandler( [&](IAsyncAction^ workItem) { ..... // background work }); IAsyncAction ^ ThreadPoolWorkItem = Windows::System::Threading::ThreadPool::RunAsync(WorkItem); Eventually, background work is reflected in UI, result(s) need to run on the correct UI thread: use CoreWindow Dispatcher
Agile Objects (are your friend)
In WinRT objects are objects And threads are threads. Forgive me while I digress into COM for a bit (you will forgive me ). How many people in this audience know what a COM apartment is? (If you do, how do you like dealing with apartments?) Better question: how many people want to know what a COM apartment is?
Traditional view of multi-threading Objects can be accessed from any thread The hard part is dealing with multi-thread contention, resource management, locking, deadlock prevention, etc. The hard part is hard enough. You already deal with “context” on a thread no matter what platform you run on: at a minimum, you deal with the context that is “UI”. But, you can set up your code so that most objects can run on any thread. And you want multi-threading to be no harder than this.
Agile Objects WinRT objects are agile by default. This bit of “magic” (1) allows WinRT objects to just be “multi-threaded objects” (as you have traditionally known them). They can run anywhere (hence the term “agile”) and you have the job of dealing with multi-threaded resource contention. Period. This is the technique that allows you to ignore that there is a thing called an apartment. There are also UI-affine objects in WinRT but you are used to dealing with these on whatever platform(s) you code for. Actually not magic: uses FTM, lousy name, but it saves you from knowing anything about apartments. Deep dive on this from Martyn Lovell’s talk at build 2013. And our Channel9 video.
UI and ASTA
UI Threads All mainstream systems bind a single thread to UI operations. Implicitly or explicitly a UI thread has a “context” into which it is bound. In the 90’s we created the notion of “apartment” which to this date almost no one understands. But just think of it as a context. Very popular “other OS” says: It is strongly recommended not to update UI controls etc from a background thread (e.g. a timer, comms etc). This can be the cause of crashes which are sometimes very hard to identify. In other words, don’t perform UI operations out of context.
UI Contexts In this world of many threads, we’re still stuck with a single thread and UI. It gets worse: turns out rather than multi-threaded races and deadlocks, UI threads can suffer from “re-entrancy”. Re-entrancy: I’m in the middle of doing thing “A”, why am I all of a sudden doing thing “B”? Re-entrancy is the cause of a large number of crashes and deadlocks. UI has two models: re-entrancy == surprises, non-re-entrancy == potential deadlocks.
How to safely update UI 101 Option 1: Background threads call directly into UI threads. Only safe if you are REALLY careful. (See slide on re-entrancy). Option 2: Background threads post notifications that UI threads process “when they are ready”. (Dispatcher model). Can UI threads call directly into objects running on background threads? Only safe if you are somewhat careful. Oh, and don’t take too long to update UI. What is definition of “too long”? In the last few years it was all about frame rate. But now, it means “be immediately responsive to input”.
Pretend you have the chance to define a new platform... What choices do you make? One of the first questions: Does your UI model allow re-entrancy? Next: do you allow arbitrary threading? How much synchronous UI update do you allow in your platform? (i.e. how much risk of “spinning donuts” do you want?)
We ended up calling that new platform: WinRT (“Windows Runtime”) Choice #1: Non-re-entrancy (“ASTA”) Choice #2: ThreadPool API’s but not arbitrary threading Choice #3: No spinning donuts: async across the API surface Choice #4: No arbitrary window creation (and you don’t have to know what an HWND is) We still have single-threaded UI frameworks: HTML/CSS and XAML And there are plenty of things you can do synchronously in these frameworks, e.g. toggle button state or update textbox/listbox/etc. contents.
Tying it together Turns out that you have to tie together the following elements for the UI and programming model to work: Window creation and behavior Window event (message) processing Dispatcher processing (remember those UI notifications we need?) Call processing (incoming and outgoing) Input And you have to prioritize these with respect to each other.
The entity that ties these together is ASTA ASTA == “UI context” (forget that one of the A’s is “apartment”) One thread per window, windows are created by contract activations (e.g. click on a tile or share something from an app). The main thread of a WinRT application runs in a multi-threaded context but is not a UI thread. i.e. the main thread is not a UI thread. As a result, of course, there is a main UI window / thread. ASTA’s are non re-entrant. When an outgoing call is in progress, an incoming call blocks. This requires careful planning, but there are no surprises. (Re-entrancy is nearly always a surprise).
Async (no spinning donuts) There has been a concerted effort to make UI responsive, this is key to the “fast and fluid” platform promise. Every API was (and is) reviewed. Any synchronous call that takes longer than the low 10’s of milliseconds to execute must be async. Every async API conforms to a uniform pattern. The language projections (C++, C#/VB.Net, Javascript) all have built- in async support.
WinRT Async 101 All WinRT async operations are “hot start”, as soon as they are produced, they are running. You can “attach” to a running async operation by supplying a completion handler (put_Completed). When the operation completes, it fires a completion callback. The type of each callback is the result type of the async operation. Of course, when you supply the completion callback, it may fire completion immediately, i.e. the operation completed already.
WinRT Async 101 When an async operation completes, its status is reported: Complete Canceled (user request to cancel was honored) Error WinRT async operations are a one-way state machine: AsyncMethod => Running => Terminal State {Completed|Error|Canceled} Processing always occurs on background thread (usually a TP work item) Pattern is completely documented in code: see WRL’s async.h
Async and Agile Async Operation objects are a good example of agile objects Mostly designed to be called from UI threads But do all of their work on background (TP) threads As a result, async operation objects are agile (and therefore directly callable from UI threads)
Async support (C++ / PPL) Language projections are key to productivity in producing WinRT apps. “Natural and familiar” is a key design point: make the experience natural and familiar for the language that is being used. For C++, the natural means of consuming async is based on the Concurrency Runtime (PPL). The model is continuation-based (.then() + lambda/functor/function) with support for exception handling and cancellation.
Async and contexts Consuming async operations is often done for the benefit of UI create_task takes note of the originating context, i.e. what kind of thread made the original call. (use_default). create_task will return the result of an async operation (if any) to that originating context by default. C++ developers of course have control over this (e.g. you may want to “de-bounce” transitions back to the UI thread until the end of a series of continuations). We’ll talk about CoreWindow’s Dispatcher method soon.
CoreWindow Dispatcher One of the things that the ProcessEvents loop on a CoreWindow schedules is dispatcher work items. CoreWindowDispatcher::RunAsync() schedules a work item to run on the UI thread associated with the CoreWindow. Since the CoreWindow processing loop processes multiple types of items, a rich priority scheme is supported.
CoreWindow Dispatcher Priority Normal priority (default) means: “run dispatcher callbacks in FIFO order, cooperative with input (and window management events)”. (Everyone runs cooperatively). Low priority means: Dispatcher callbacks run when there is no input pending. (Input beats dispatcher, i.e. app is responsive to input). Idle priority means: Dispatcher callbacks run when there is nothing else in any queue. (everything beats dispatcher). High priority means: dispatcher callbacks run ahead of everything else. (dispatcher beats everything). Docs say: “don’t use this” (more on this in a second).
Rendering UI This is not your father’s message loop. Everything used to happen on the UI thread: message processing, paint, computation, animation, alpha blending, etc. As applications became more graphically intensive, everything became a slave to frame rate (60 fps == 16.667ms “window”). Get everything done by deadline. Very difficult model to get right. WinRT render threads are separate from ASTA UI threads. Composition is separate from render. Goodness: relieves pressure on the (precious) UI thread
Responsive UI The “paint beat” is tied to the frame rate. The frameworks (WinJS and xaml) take care of this. Most UI operations are “tweaks” to layout (change text in a text box, scroll, get image ready to render, etc.). These occur directly on UI thread. Initial layout and large changes to layout (e.g. navigation) are critical and time-sensitive. You have to beat the next vsync. If you can’t beat the next vsync (common) then you can create a transition animation while the next layout is being prepared. (independent animations do not run on the UI thread)
Responsive UI What does this have to do with threading? CoreWindowDispatcher priority High serves a couple of different “system” purposes: In WinJS – High priority is used for layout changes. Think: layout beats everything In CoreApplication, the “app object” in every WinRT application, the suspend notification is delivered on UI threads via the Dispatcher (for apps, responding to Suspend is highly time critical) If you choose to use High priority, remember that it is an extremely sharp knife: highly effective and dangerous when used improperly.
Tying it all back together UI threads are created by the system and managed by CoreWindow processing loop. Background threads run via WinRT ThreadPool or language projection components (sitting on WinRT ThreadPool). CoreWindow Dispatcher schedules work on UI threads, with a rich priority scheme. Responsive and useful app UI means balancing: CPU utilization on background threads UI thread processing that is responsive to input and never blocks (including never running long workloads)
Which means... Responsive programs are sequences of async processing followed by a return to UI thread either by: Direct context capture (return to point of origination); or Explicit call to CoreWindow Dispatcher
Async Investments Microsoft continues to invest heavily and innovate in the async space. C# introduced await keyword, dramatically simplifying async programming. Await allows async consumption code to read in a more logical flow. It looks like synchronous code, but the block of code following an await statement executes “later”. Think you don’t need/want this? Go to Herb Sutter’s //build 2013 talk, and fast forward to about 50 minutes in. And then make sure you attend Deon’s talk later today where we tell you Everything You Ever Wanted To Know about C++ await.
In Summary... Probably not everything you always wanted to know The valuable assets you have in your C++ code come to light in UI And the threading and coding rules for UI are different than they were 5-10 years ago (more input responsiveness, more async, etc.) Optimize background processing in units of work and think carefully about the relative priorities of that work. (And don’t reinvent the threadpool!) You do really want await (don’t miss Deon’s talk).