Futex vs pthread_condvar

I was looking into building a small library for sharing data between multiple processes. I want it to support different channels that store the previous published message, while also allowing consumers to wait on a set of channels and wake up when new data is available on any of them.

I also wanted the library to be "brokerless", meaning that there is no special process that is copying and forwarding data around. Each producer directly writes to the visible memory and locking the channel, then notifies the consumers. Locking the channels is key to prevent data races. Each consumer must also lock the channel while it's reading from it. But if a reader-writer style of lock is used, we can have multiple readers at once.

An earlier version of this library operated only in one proccess and I was able to use C++'s one std::shared_mutex per channel for locking and one eventfd per channel for signalling events. Then consumers could epoll on any number of the channels.

But with a multiprocess approach, eventfd cannot be used. My understanding is that, on Linux, std::shared_mutex is implemented using pthread_rwlock_t, but that it's UB to try to use the C++ std wrapper on a raw memory region from a shared memory segment. So I wrote my own wrapper around the pthread type.

One idea to replace the eventfd+epoll event system for multiple processes would be to have a condition variable ("cv") per consumer. But this requires each channel managing a list of consumers, and a lot of locking on the consumer side.

Another idea would be to have one condition variable ("cv") for all of the channels combined, and having the condition check fail if a channel we are not subscribed to is notified. But this means that every publish of any message wakes every consumer thread, only for most of them to immediately

go back to sleep when the condition check fails. Note that an approach with multiple condition variables cannot be used because only one can be waited upon.

A similar setup can be achieved with a futex. The futex wait/wake operations can achieve something similar to a condition variable. The futex wake operation can provide a 32-bit mask to which a waiter will only be notified if it's specifid bit mask intersects the signalled one.

So far this is exactly like the global condiiton variable approach. But the wording of the futex(2) man page made it seem that there is one important difference: the futex mask is checked by the *waker*, rather than in a condition variable where *all waiters must be woken* and each checks its condition.

So using futex will wake only the threads that should be. All of these woken threads are the ones we need, and none more. But using the global condition variable all threads are woken, and potentially many go right back to sleep. pthread_cond_broadcast checks the condition on the waker thread, so all waiter threads must wakeup to test the condition.

At least that was what the wording of the man page seemed to suggest. To verify this is how it worked, I needed to look at the kernel code, then run some benchmarks to make sure the logic worked in practice and futex would be faster.

Futex Source Code

Most of the relevant code is in /kernel/futex/waitwake.c and core.c.

Let's start with the waiter. __futex_wait calls futex_wait_queue, which puts the current task state as TASK_INTERRUPTIBLE (i.e. to be considered sleeping/waiting), before calling schedule, which itself calls __schedule, which will possibly context switch to a different task. It also appends this waiter's "queue" to a bucket depending on the uaddr hash. The queue is used to refer to this task and wake it up with the waker. It looks like __schedule_loop does not return until the task has been woken due to a reschedule, which in this case must occur after the timeout (I don't use this) or the signal from the waker.

On the waker side, futex being called with FUTEX_WAIT leads to futex_wait. This function checks all futex_qs in the hash bucket mapped from the uaddr. It loops over the queues and finds all matching queues based on the futex key. Then on line 188 we can see the bit-mask check. The function calls the futex_q->wait function pointer to signal to the waiter futex_q to wakeup. Now I'm lost on exactly how this signalling works, and how the scheduler knows to reschedule the waiting tasks. I can't trace back where the function pointer is set, but at this point my curiosity is satisfied.

pthread Cond Var Source Code

The pthread_cond_wait function is pretty complicated. It seems to be built on futexes, but includes several other steps.

Only having used the C++ std::condition_variable API, whose "wait" method includes the performing a "condition" check based on a user defined predicate function, I was surprised the pthread API does not. It does suggest doing it manually, however:

> When using condition variables there is always a boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_wait() or pthread_cond_timedwait() functions may occur. Since the return from pthread_cond_wait() or pthread_cond_timedwait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.

From this it's clear: all waiting threads are woken and check the condition after waking up. That makes me wonder if it would be a good idea for a kernel API that allows doing the checks on the waker side.

Benchmarks

I created a benchmark comparing the two approaches. It does not use the pthread condvar in exactly the typical way, but it's pretty close. Instead of checking the condition while the mutex is held, I use a check of an atomic counter after the mutex is released. This makes the benchmark more similar for the two approaches.

The benchmark creates 8 Channels, 8 Writers, and 32 Readers. Each reader waits for events on four channels depending, plus all readers wait on channel 1, to simulate a very busy channel. Each writer writes to the one channel associated with it.

Messages are pushed for 10 seconds and the number of reads are counted. This is repeated several times, but the results are pretty similar. Results from one run follow:

-------------------------------------------------------
 - pthread condvar test
 - Avg Read Latency:  57.4us,   10228 writes  172943 reads
 - Context switches (v 452398) (iv 11619)

-------------------------------------------------------
 - futex condvar test
 - Avg Read Latency:  25.7us,   10400 writes  175418 reads
 - Context switches (v 177688) (iv 119)

The condition variable approach has more latency and marginally less reads. But it also has 3 times as many context switches, meaning it'll bog the system down much more. This is what I expected given the difference explained above. A global condition variable will require waking all threads when it's notified, whereas the futex mask is checked on the waker side. The number of context switches is only slightly more than the number of reads!

Pushing less total messages did cause the latencies to come closer, but the number of context switches was always much more for the condition variable approach. I don't analyze the overhead of publishing a message, but for my usecases of under a dozen or so waiters, the futex wait operation requiring traversing all waiters is neglible.