Honnappa Nagarahalli Arm

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.

Chapter 6: Process Synchronization

Maged M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock- Free Objects” Presentation Robert T. Bauer.

CS510 Concurrent Systems Jonathan Walpole. What is RCU, Fundamentally? Paul McKenney and Jonathan Walpole.

A Pipeline for Lockless Processing of Sound Data David Thall Insomniac Games.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

Memory Management Design & Implementation Segmentation Chapter 4.

Tirgul 9 Amortized analysis Graph representation.

Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.

Synchronization Principles. Race Conditions Race Conditions: An Example spooler directory out in 4 7 somefile.txt list.c scores.txt Process.

What is RCU, fundamentally? Sri Ramkrishna. Intro RCU stands for Read Copy Update  A synchronization method that allows reads to occur concurrently with.

Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Threads. Processes and Threads  Two characteristics of “processes” as considered so far: Unit of resource allocation Unit of dispatch  Characteristics.

CPS110: Implementing threads/locks on a uni-processor Landon Cox.

Nachos Phase 1 Code -Hints and Comments

Scheduler Activations: Effective Kernel Support for the User- Level Management of Parallelism. Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,

Internet Software Development Controlling Threads Paul J Krause.

CSC321 Concurrent Programming: §5 Monitors 1 Section 5 Monitors.

11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.

Kernel Locking Techniques by Robert Love presented by Scott Price.

Memory management.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.

Monitors and Blocking Synchronization Dalia Cohn Alperovich Based on “The Art of Multiprocessor Programming” by Herlihy & Shavit, chapter 8.

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

CS333 Intro to Operating Systems Jonathan Walpole.

1 Read-Copy Update Paul E. McKenney Linux Technology Center IBM Beaverton Jonathan Appavoo Department.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,

1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

Process Synchronization. Concurrency Definition: Two or more processes execute concurrently when they execute different activities on different devices.

Jim Fawcett CSE 691 – Software Modeling and Analysis Fall 2000

Basic Paging (1) logical address space of a process can be made noncontiguous; process is allocated physical memory whenever the latter is available. Divide.

Non Contiguous Memory Allocation

Chapter 2 Memory and process management

Segmentation COMP 755.

Subject Name: File Structures

Outline Paging Swapping and demand paging Virtual memory.

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

Faster Data Structures in Transactional Memory using Three Paths

Lecture 25 More Synchronized Data and Producer/Consumer Relationship

Day 13 Concurrency.

Day 15 Concurrency.

CS510 Concurrent Systems Jonathan Walpole.

Programming – Touch Sensors

…and web frameworks in general

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

Circular Buffers, Linked Lists

Optimizing Malloc and Free

CS510 Concurrent Systems Jonathan Walpole.

CS510 - Portland State University

Producer-Consumer Problem

Honnappa Nagarahalli Principal Software Engineer Arm

Dr. Mustafa Cem Kasapbaşı

Semaphores Chapter 6.

Top Half / Bottom Half Processing

CSCI1600: Embedded and Real Time Software

Management From the memory view, we can list four important tasks that the OS is responsible for ; To know the used and unused memory partitions To allocate.

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Foundations and Definitions

Programming with Shared Memory Specifying parallelism

Inside the Database Engine

MV-RLU: Scaling Read-Log-Update with Multi-Versioning

Inside the Database Engine

Inside the Database Engine

Presentation transcript:

Honnappa Nagarahalli Arm RCU Integration with libraries (Resource reclamation framework in DPDK) It would be more appropriate to call this ‘Resource reclamation framework in DPDK’ Honnappa Nagarahalli Arm

Agenda Recap Resource reclamation General process With lib-urcu1 With rte_rcu_qsbr Resource Reclamation Framework for DPDK Performance Recap – Intention is to get everyone on the same page on problem definition Resource reclamation process Generic process - A look at the generic process to reclaim resources, look at different parts of the application (main thread, writers and readers) and what roles they have to play in the process. With lib-urcu – Looks like people are familiar with lib-urcu. It makes sense to talks briefly about what it provides (and then cover how what I am proposing here differs from that). Look at lib-urcu provides, 2 mechanisms – call_rcu and defer_rcu (marked as experimental) With rte_rcu_qsbr - Look at how this process looks like with rte_rcu_qsbr. Proposal for a DPDK application – Same different parts in DPDK application too, but describe the major part which is the library integration. Performance [1] https://liburcu.org/

RCU helps the writer determine the end of Grace Period Recap Remove reference to entry1 Delete Delete entry1 from D1 Grace Period (GP) Free Delete entry2 from D1 D1 D2 Reader Thread 1 T 2 T 3 Time Quiescent states (QS) Critical sections Lock free algorithms provide scalability and determinism. However, they pose the problem of reclaiming memory. i.e. memory cannot be ‘freed’ immediately after ‘deleting’ an entry from the data structure as reader threads might still be using the deleted entry. So the process of deleting an entry from the data structures consists of 3 steps: Remove the reference of the entry from the data structure Ensure that all the readers are not referencing that entry anymore Free the entry RCU library helps the writer know when it is safe to free memory. Free resources for entries1 and 2 after every reader has gone through at least 1 quiescent state RCU helps the writer determine the end of Grace Period

Resource Reclamation – General process General process can be divided into following parts Initialization Quiescent State reporting Resource Reclamation – This is the focus of this discussion Shutdown The memory reclamation process using QSBR can be split into 4 parts for our discussion: Initialization Quiescent State Reporting Reclaiming Resources Shutdown The application/main thread has to handle any 'Initialization’ parts of the process such as creating RCU variable, registering the reader threads to report their quiescent state status. The reader threads in the application have to report their quiescent state status. This provides the application control over the length of the critical section and grace period. The writer threads/the application has to wait for the end of the grace period and free the actual resources. This will be the focus of this discussion. During the shutdown process, the application has to make sure the reader threads are unregistered and all the resources are freed. Once we discuss the reclamation of the resources, we will get back to each of these parts in the context of DPDK.

Resource Reclamation – Trivial Process Reader ThreadN Reader Thread2 Reader Thread1 Writer Thread Lock-Free Data structure RCU State rte_data_structure_delete () { data_structure_delete_entry(); rte_rcu_qsbr_check(wait == true); data_structure_free_entry(); } Delete Start GP Report QS Poll QS Status Poll QS Status Report QS Poll QS Status Poll QS Status Poll QS Status Report QS Poll QS Status Poll QS Status This is the trivial process. Advantages: The writer does not need to be aware if the data structure is lock-free. No code changes required in the writer to switch to a lock-free version of the algorithm. This is the most ideal situation for the writer. Disadvantages: Writer wastes cycles polling the RCU state. Reduces the performance of the writer thread. Traffic on the interconnect increases: as writers and readers are accessing the RCU state simultaneously due to polling from writer Free Writer is unaware of the lock-free-ness of the algorithm No code changes required in writer to switch to lock-free algorithm Writer polls between ‘Delete’ and ‘Free’, which reduces writer’s performance Writer and readers access RCU state concurrently

Resource Reclamation – With lib-urcu, call_rcu Writer Thread Lock-Free Data structure Defer Queue - Global Reclamation Thread RCU State Reader ThreadN Reader Thread1 rte_data_structure_delete () { data_structure_delete_entry(); call_rcu(deleted resource); } Delete Enqueue Resource (call_rcu) Delete Enqueue Resource (call_rcu) Dequeue Resource Dequeue Resource Report QS Start GP (synchronize_rcu) Poll QS Status Poll QS Status Poll QS Status Report QS Free Poll QS Status call_rcu is one of the methods that lib-urcu provides to save the cycles spent in the writer thread as well as number of cycles spent on resource reclamation. There are several ways to use call_rcu method. The method described here is claimed to suffice most of the use cases. This method introduces a global ‘defer queue’. The writers enqueue the deleted resource to this queue for later reclamation. The actual reclamation is offloaded to a ‘Reclamation Thread’. The reclamation thread waits for 1 grace period and reclaims all the resources that were deleted before the start of the grace period. Advantages: Writer does not do continuous polling, instead does other useful work allowing to handle any drop in writer thread’s performance. Since a single GP is enough to reclaim a bunch of resources, the cost is amortized. Disadvantages: Another thread added to the application. The global queue is a shared resource between writers and reclamation thread. Increases the time to reclaim a resource as the grace period does not start immediately after delete. Any attempts to decrease the time to reclaim will result in increasing the traffic on the interconnect. Traffic on the interconnect is still affected, but magnitude is far less as waiting for grace period happens once for a bunch of the deleted resources. If the writer runs out of resources, it cannot reclaim any resources, it has to wait for the reclamation thread to kick in. Free Writer does not poll, performance not affected Cost of reclamation is amortized Another thread added to the application If writer runs out of resources… Polling still exists – in reclamation thread Time to reclaim increases Contention on global defer queue

Resource Reclamation – With lib-urcu, rcu_defer Writer Thread Lock-Free Data structure TLS Queue Defer Queue - Per Thread Reclamation Thread RCU State Reader ThreadN Reader Thread1 rte_data_structure_delete () { data_structure_delete_entry(); rcu_defer(deleted resource); } Delete Enqueue Resource (rcu_defer) Delete Enqueue Resource (rcu_defer) Dequeue Resource Dequeue Resource Report QS Start GP (synchronize_rcu) Poll QS Status Poll QS Status Poll QS Status Report QS Free Poll QS Status rcu_defer is marked as experimental, not sure why. rcu_defer solves some of the problems in call_rcu. It introduces per writer thread fixed queues (4K entries?). The reclamation thread wakes up every 100ms and goes through the list of queues and reclaims resources. Some of the advantages and disadvantages over the call_rcu method are: Advantages: Defer queue is per writer thread Disadvantages: Nothing additional over call_rcu. If the writer thread runs out of resources, it still does not reclaim the resources. Free Less contention on defer queue If writer finds defer queue is full It reclaims resources Writer thread has to poll again for 1 GP

Resource Reclamation – With rte_rcu_qsbr No reclamation thread. Reclamation runs in the context of writer thread. Defer queue per data structure. Deletion rte_data_structure_delete () { data_structure_delete_entry(); rte_rcu_qsbr_start(); /* Start the GP ASAP */ if (defer_queue_full) reclaim_resource(); /* Mostly no waiting for GP */ enqueue_resource(); } Reclamation __rte_reclaim_resource () { peek_queue(); if (rte_rcu_qsbr_check(wait = FALSE) == SUCCESS) { /* No Continuous polling */ dequeue_resource(); free_resource(); Addition rte_data_structure_add () { if (no_free_resources) reclaim_resource(); /* Reclaim the exact resources needed */ data_structure_add_entry(); Batching benefits enabled by new patch in RCU library1 Lock-Free Data structure Lock-Free Data Structure + Defer Queue Writer Thread RCU State Reader ThreadN Reader Thread1 Delete Start GP (rte_rcu_qsbr_start) Enqueue Resource Delete Start GP (rte_rcu_qsbr_start) Enqueue Resource Report QS The method proposed here solves some of the issues described. The method proposed for DPDK gets rid of the Reclamation Thread, which means less contention. The method uses the writer thread context to do memory reclamation. This can be done as the synchronize_rcu API in lib-urcu is split into 2 APIs in rte_rcu library which allows for grace period to start immediately after deleting the resource and complete while writer is doing other work. There is one defer queue per data structure. All the resources belonging to a data structure are in 1 queue. Allows for reclaiming the exact resources needed. If lock-free reclamation is required, one could create a defer queue per data structure and per thread as well. During deletion – if the defer queue is full, resources are reclaimed. Mostly there will not be any waiting for the grace period since the GP was started when the resources were deleted. During addition – If there are no free resources, resources are reclaimed. Since the queue contains only the resources belonging to this data structure, only the required resources are reclaimed. This guarantees that the writer fails only when there are no resources in the system. Peek Queue Free Check GP (rte_rcu_qsbr_check) Dequeue No contention on defer queue Less contention on RCU State [1] https://patchwork.dpdk.org/patch/58960/

Resource Reclamation Proposal for DPDK Initialization Responsibility - Application/main thread Allocating RCU variable Registering reader threads Provide the RCU variable to the data structure library Quiescent State reporting Responsibility - Application/reader threads Provides flexibility to the application This proposal takes the general process described earlier and maps the work on to different parts of a DPDK application. It is the responsibility of the main thread/application to allocate the RCU variable, register the reader threads and configure the RCU variable to use with the library The readers will report quiescent states on the RCU variable Talk about the defer queue size

Resource Reclamation Proposal for DPDK Responsibility – Data structure library This removes a significant burden from the application No code changes to application’s writer thread Provide an API to register the RCU variable to use Create a defer queue to store the deleted resource and token Augment data structure delete entry API Start the grace period after deleting the resource by calling rte_rcu_qsbr_start If the defer queue is full – Reclaim resources Otherwise, enqueue the deleted resource and token to the defer queue

Resource Reclamation Proposal for DPDK Resource Reclamation (continued) Augment data structure add entry API If there are no free resources – Reclaim resources Add the entry to the data structure Reclaim resources Peek the token at the head of the defer queue Use non-blocking rte_rcu_qsbr_check API to query the quiescent state If success, dequeue the resource/token from defer queue and free the resource

Resource Reclamation Proposal for DPDK Shutdown Responsibility – Application and Data structure library Application Ensure reader threads are not using the data structure Unregister the reader threads Data structure library Reclaim all the resources on defer queue

Performance Test setup Numbers LPM library integrated with DPDK RCU library 1 writer thread, 42M adds/deletes routes with prefix length > 24 11 reader threads report the quiescent state status every 1024 lookups Numbers Without RCU integration: 2484.4 cycles With RCU integration: 2517.25 cycles (1.3%) Since the reclamation is being done in the context of the writer thread, it is important to look at writer’s performance

Next Steps New APIs Provide APIs in rte_rcu for common functionality Create defer queue (rte_rcu_qsbr_dq_create, rte_rcu_qsbr_delete) Push resources to defer queue (rte_rcu_qsbr_dq_enqueue) Reclaim resources (rte_rcu_qsbr_dq_reclaim) Thanks for the comments on the patches. I will add 3 new APIs to rte_rcu library (yet to be done). These APIs will abstract common functionality required for resource reclamation and will make it easy for external libraries to integrate with RCU library. These will also make the changes to rte_lpm and rte_hash much simpler.

Thanks to Ruifeng Wang – Integrating RCU with LPM Dharmik Thakkar – Integrating RCU with Hash

Thank you Questions?