Download presentation
Presentation is loading. Please wait.
1
Honnappa Nagarahalli Arm
RCU Integration with libraries (Resource reclamation framework in DPDK) It would be more appropriate to call this ‘Resource reclamation framework in DPDK’ Honnappa Nagarahalli Arm
2
Agenda Recap Resource reclamation
General process With lib-urcu1 With rte_rcu_qsbr Resource Reclamation Framework for DPDK Performance Recap – Intention is to get everyone on the same page on problem definition Resource reclamation process Generic process - A look at the generic process to reclaim resources, look at different parts of the application (main thread, writers and readers) and what roles they have to play in the process. With lib-urcu – Looks like people are familiar with lib-urcu. It makes sense to talks briefly about what it provides (and then cover how what I am proposing here differs from that). Look at lib-urcu provides, 2 mechanisms – call_rcu and defer_rcu (marked as experimental) With rte_rcu_qsbr - Look at how this process looks like with rte_rcu_qsbr. Proposal for a DPDK application – Same different parts in DPDK application too, but describe the major part which is the library integration. Performance [1]
3
RCU helps the writer determine the end of Grace Period
Recap Remove reference to entry1 Delete Delete entry1 from D1 Grace Period (GP) Free Delete entry2 from D1 D1 D2 Reader Thread 1 T 2 T 3 Time Quiescent states (QS) Critical sections Lock free algorithms provide scalability and determinism. However, they pose the problem of reclaiming memory. i.e. memory cannot be ‘freed’ immediately after ‘deleting’ an entry from the data structure as reader threads might still be using the deleted entry. So the process of deleting an entry from the data structures consists of 3 steps: Remove the reference of the entry from the data structure Ensure that all the readers are not referencing that entry anymore Free the entry RCU library helps the writer know when it is safe to free memory. Free resources for entries1 and 2 after every reader has gone through at least 1 quiescent state RCU helps the writer determine the end of Grace Period
4
Resource Reclamation – General process
General process can be divided into following parts Initialization Quiescent State reporting Resource Reclamation – This is the focus of this discussion Shutdown The memory reclamation process using QSBR can be split into 4 parts for our discussion: Initialization Quiescent State Reporting Reclaiming Resources Shutdown The application/main thread has to handle any 'Initialization’ parts of the process such as creating RCU variable, registering the reader threads to report their quiescent state status. The reader threads in the application have to report their quiescent state status. This provides the application control over the length of the critical section and grace period. The writer threads/the application has to wait for the end of the grace period and free the actual resources. This will be the focus of this discussion. During the shutdown process, the application has to make sure the reader threads are unregistered and all the resources are freed. Once we discuss the reclamation of the resources, we will get back to each of these parts in the context of DPDK.
5
Resource Reclamation – Trivial Process
Reader ThreadN Reader Thread2 Reader Thread1 Writer Thread Lock-Free Data structure RCU State rte_data_structure_delete () { data_structure_delete_entry(); rte_rcu_qsbr_check(wait == true); data_structure_free_entry(); } Delete Start GP Report QS Poll QS Status Poll QS Status Report QS Poll QS Status Poll QS Status Poll QS Status Report QS Poll QS Status Poll QS Status This is the trivial process. Advantages: The writer does not need to be aware if the data structure is lock-free. No code changes required in the writer to switch to a lock-free version of the algorithm. This is the most ideal situation for the writer. Disadvantages: Writer wastes cycles polling the RCU state. Reduces the performance of the writer thread. Traffic on the interconnect increases: as writers and readers are accessing the RCU state simultaneously due to polling from writer Free Writer is unaware of the lock-free-ness of the algorithm No code changes required in writer to switch to lock-free algorithm Writer polls between ‘Delete’ and ‘Free’, which reduces writer’s performance Writer and readers access RCU state concurrently
6
Resource Reclamation – With lib-urcu, call_rcu
Writer Thread Lock-Free Data structure Defer Queue - Global Reclamation Thread RCU State Reader ThreadN Reader Thread1 rte_data_structure_delete () { data_structure_delete_entry(); call_rcu(deleted resource); } Delete Enqueue Resource (call_rcu) Delete Enqueue Resource (call_rcu) Dequeue Resource Dequeue Resource Report QS Start GP (synchronize_rcu) Poll QS Status Poll QS Status Poll QS Status Report QS Free Poll QS Status call_rcu is one of the methods that lib-urcu provides to save the cycles spent in the writer thread as well as number of cycles spent on resource reclamation. There are several ways to use call_rcu method. The method described here is claimed to suffice most of the use cases. This method introduces a global ‘defer queue’. The writers enqueue the deleted resource to this queue for later reclamation. The actual reclamation is offloaded to a ‘Reclamation Thread’. The reclamation thread waits for 1 grace period and reclaims all the resources that were deleted before the start of the grace period. Advantages: Writer does not do continuous polling, instead does other useful work allowing to handle any drop in writer thread’s performance. Since a single GP is enough to reclaim a bunch of resources, the cost is amortized. Disadvantages: Another thread added to the application. The global queue is a shared resource between writers and reclamation thread. Increases the time to reclaim a resource as the grace period does not start immediately after delete. Any attempts to decrease the time to reclaim will result in increasing the traffic on the interconnect. Traffic on the interconnect is still affected, but magnitude is far less as waiting for grace period happens once for a bunch of the deleted resources. If the writer runs out of resources, it cannot reclaim any resources, it has to wait for the reclamation thread to kick in. Free Writer does not poll, performance not affected Cost of reclamation is amortized Another thread added to the application If writer runs out of resources… Polling still exists – in reclamation thread Time to reclaim increases Contention on global defer queue
7
Resource Reclamation – With lib-urcu, rcu_defer
Writer Thread Lock-Free Data structure TLS Queue Defer Queue - Per Thread Reclamation Thread RCU State Reader ThreadN Reader Thread1 rte_data_structure_delete () { data_structure_delete_entry(); rcu_defer(deleted resource); } Delete Enqueue Resource (rcu_defer) Delete Enqueue Resource (rcu_defer) Dequeue Resource Dequeue Resource Report QS Start GP (synchronize_rcu) Poll QS Status Poll QS Status Poll QS Status Report QS Free Poll QS Status rcu_defer is marked as experimental, not sure why. rcu_defer solves some of the problems in call_rcu. It introduces per writer thread fixed queues (4K entries?). The reclamation thread wakes up every 100ms and goes through the list of queues and reclaims resources. Some of the advantages and disadvantages over the call_rcu method are: Advantages: Defer queue is per writer thread Disadvantages: Nothing additional over call_rcu. If the writer thread runs out of resources, it still does not reclaim the resources. Free Less contention on defer queue If writer finds defer queue is full It reclaims resources Writer thread has to poll again for 1 GP
8
Resource Reclamation – With rte_rcu_qsbr
No reclamation thread. Reclamation runs in the context of writer thread. Defer queue per data structure. Deletion rte_data_structure_delete () { data_structure_delete_entry(); rte_rcu_qsbr_start(); /* Start the GP ASAP */ if (defer_queue_full) reclaim_resource(); /* Mostly no waiting for GP */ enqueue_resource(); } Reclamation __rte_reclaim_resource () { peek_queue(); if (rte_rcu_qsbr_check(wait = FALSE) == SUCCESS) { /* No Continuous polling */ dequeue_resource(); free_resource(); Addition rte_data_structure_add () { if (no_free_resources) reclaim_resource(); /* Reclaim the exact resources needed */ data_structure_add_entry(); Batching benefits enabled by new patch in RCU library1 Lock-Free Data structure Lock-Free Data Structure + Defer Queue Writer Thread RCU State Reader ThreadN Reader Thread1 Delete Start GP (rte_rcu_qsbr_start) Enqueue Resource Delete Start GP (rte_rcu_qsbr_start) Enqueue Resource Report QS The method proposed here solves some of the issues described. The method proposed for DPDK gets rid of the Reclamation Thread, which means less contention. The method uses the writer thread context to do memory reclamation. This can be done as the synchronize_rcu API in lib-urcu is split into 2 APIs in rte_rcu library which allows for grace period to start immediately after deleting the resource and complete while writer is doing other work. There is one defer queue per data structure. All the resources belonging to a data structure are in 1 queue. Allows for reclaiming the exact resources needed. If lock-free reclamation is required, one could create a defer queue per data structure and per thread as well. During deletion – if the defer queue is full, resources are reclaimed. Mostly there will not be any waiting for the grace period since the GP was started when the resources were deleted. During addition – If there are no free resources, resources are reclaimed. Since the queue contains only the resources belonging to this data structure, only the required resources are reclaimed. This guarantees that the writer fails only when there are no resources in the system. Peek Queue Free Check GP (rte_rcu_qsbr_check) Dequeue No contention on defer queue Less contention on RCU State [1]
9
Resource Reclamation Proposal for DPDK
Initialization Responsibility - Application/main thread Allocating RCU variable Registering reader threads Provide the RCU variable to the data structure library Quiescent State reporting Responsibility - Application/reader threads Provides flexibility to the application This proposal takes the general process described earlier and maps the work on to different parts of a DPDK application. It is the responsibility of the main thread/application to allocate the RCU variable, register the reader threads and configure the RCU variable to use with the library The readers will report quiescent states on the RCU variable Talk about the defer queue size
10
Resource Reclamation Proposal for DPDK
Responsibility – Data structure library This removes a significant burden from the application No code changes to application’s writer thread Provide an API to register the RCU variable to use Create a defer queue to store the deleted resource and token Augment data structure delete entry API Start the grace period after deleting the resource by calling rte_rcu_qsbr_start If the defer queue is full – Reclaim resources Otherwise, enqueue the deleted resource and token to the defer queue
11
Resource Reclamation Proposal for DPDK
Resource Reclamation (continued) Augment data structure add entry API If there are no free resources – Reclaim resources Add the entry to the data structure Reclaim resources Peek the token at the head of the defer queue Use non-blocking rte_rcu_qsbr_check API to query the quiescent state If success, dequeue the resource/token from defer queue and free the resource
12
Resource Reclamation Proposal for DPDK
Shutdown Responsibility – Application and Data structure library Application Ensure reader threads are not using the data structure Unregister the reader threads Data structure library Reclaim all the resources on defer queue
13
Performance Test setup Numbers
LPM library integrated with DPDK RCU library 1 writer thread, 42M adds/deletes routes with prefix length > 24 11 reader threads report the quiescent state status every 1024 lookups Numbers Without RCU integration: cycles With RCU integration: cycles (1.3%) Since the reclamation is being done in the context of the writer thread, it is important to look at writer’s performance
14
Next Steps New APIs Provide APIs in rte_rcu for common functionality
Create defer queue (rte_rcu_qsbr_dq_create, rte_rcu_qsbr_delete) Push resources to defer queue (rte_rcu_qsbr_dq_enqueue) Reclaim resources (rte_rcu_qsbr_dq_reclaim) Thanks for the comments on the patches. I will add 3 new APIs to rte_rcu library (yet to be done). These APIs will abstract common functionality required for resource reclamation and will make it easy for external libraries to integrate with RCU library. These will also make the changes to rte_lpm and rte_hash much simpler.
15
Thanks to Ruifeng Wang – Integrating RCU with LPM
Dharmik Thakkar – Integrating RCU with Hash
16
Thank you Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.