Monday, December 21, 2009

User space RCUs

Linux 2.6 kernel has RCU library for a long time. It is clear that RCU based locking improves performance of network processing applications in kernel space. Any network packet processing application which requires 'search' operation on data structures on per packet basis would greatly benefit in performance using RCU based locks. Read/write locks are lot more expensive even though there is no contention. I guess the number of instructions to take and release lock is high.

It recent past many network device vendors are moving their applications to user space for multiple reasons. Some of the reasons are:
  • Easy maintenance.
  • Applications are becoming complex and Linux user space has many more debugging tools compared to Linux user space.
  • Many library utilities are available for complex networking applications (Example: PCRE libary, openSSL library, XML libraries, Compression libraries etc..
Since RCUs are used to improve performance of networking applications in kernel space, it is natural to expect RCU library for user space applications too.

Fortunately, there is one user space RCU library. One can download it from here.
There is one good presentation comparing the performance with read/write mutexes and different types of RCU implementation. Please find it here.

I, personally like QSBR approach. It has very good 'read' and 'update' performance. I believe RCU DEFER with QSBR would be even better.

Usage is simple and similar to Kernel space RCU:

Read Side:
  • rcu_read_lock
  • critical section of the code. (Typically searching for matching node from data structures such as linked lists, hash lists etc...)
  • rcu_read_unlock.
Update (Add/Delete) side:
  • Thread specific Mutex Lock (outside of RCUs) : This is required to protect multiple threads doing the add/delete operations simultaneously).
  • Critical section of the code. (which removes the node from the data structures such as linked list or adds a new node to the data structure)
  • Thread specific Mutex Unlock
  • synchronize_rcu()
  • Free the node (if it is delete operation)
If you use "RCU defer" mechanism, update side usage would look like this.
  • Thread specific Mutex Lock (outside of RCUs) :
  • Critical section of the code. (
  • Thread specific Mutex Unlock
  • defer_rcu(Freefnptr, node that was removed)
  • FreeFnPtr would get called at later point of time. This function would free the node that was passed to it.
RCU defer mechanism is expected to be faster than doing synchronize_rcu for every update operation. RCU_defer mechanism does one synchronize_rcu for a batch. As you can see from the source code, synchronize_rcu is expensive. This function ensures that all threads completed their read side operations and it involves getting indications from other threads. Since it is expensive, doing once for every batch is improves performance of update operations.

It appears that even without the 'defer' mechanism, QSBR read performance is 1000 times more than pthread read/write lock and QSBR update performance is at least 15 times better than pthread read/write locks.

I hope that it would get into Linux distribution soon.

No comments: