Programming with Threads: Questions Frequently Asked by C and C++ Programmers

Authors: Hans-J. Boehm, HP Labs & Paul McKenney, IBM

Disclaimer: These are only the authors' opinions.

In the process of trying to define the semantics of threads for C++0x, it has become clear that there are widespread misunderstandings about the use of threads in these languages. This is an attempt to correct some of these.

We cheat in several ways:

To start with, we pretend that something similar to our current proposals were part of the current language standards, and you could use them with your current compiler. They're not. You can't. They're not even part of the working paper, yet. (For the most part, the basic ideas are uncontroversial, so the odds are good they will be. And nobody currently expects the C++0x standard to be voted out without thread support. But details will change.) We do try to briefly address the additional issues arising with current compilers near the end of this document.
Not all of these are really frequently asked questions. We made up some that we wish people would ask.

Is this known to reflect the position of any major organization or corporation?

No. At least not currently.

Why are you (mostly) assuming the existence of a standard that is at least two years out?

Because we're not sufficiently satisfied with the current answers. See below.

What basic rules should I follow when writing multi-threaded C or C++ code?

The basic rule you need to follow when writing multi-threaded code is:

Do not allow data races.

This of course is not the only thing you need to know to write correct parallel code. You still need to make sure that actions that should logically be indivisible actually are, that you avoid deadlocks, that your algorithm is sufficiently parallel for the performance you need, etc. But we do think it addresses much of the confusion that has surrounded this area.

In the past, it was much easier to justify code with "benign" data races, and such code is quite common, particularly for older code, or code that could not tolerate the performance overhead associated with locking. Atomic objects now provide a clearer and safer way to write such code.

So what's a "data race"?

A data race occurs if two accesses to the same memory location by different threads are not ordered, at least one of them stores to the memory location, and at least one of them is not a synchronization action.

So what's a "synchronization action"?

An operation on a specially declared atomic<T> object (or C equivalent), or an operation on a lock, condition variable, or the like.

What's an atomic object?

An object that may safely be concurrently manipulated by more than one thread, even if one if them is updating the object. We currently expect the type of such an object to be spelled atomic<T>, though there maybe other ways to do so by the time the standard is finished. Updates to such objects appear indivisible; a thread concurrently reading the object will see the old or the new value, never something in between. In order to ensure this, an atomic<T> variable, unlike a volatile T variable, may have stricter alignment constraints than a plain type T variable.

What does it take to order two memory operations?

If you are not being devious, two memory operations are ordered if they cannot occur simultaneously. If you are being devious, see the "happens before" definition in the standards proposal. To our knowledge, the two diverge only in the case of blatant abuse of a trylock (nonblocking lock acquisition) primitive. See Boehm, Reordering Constraints for Pthread-Style Locks, PPOPP 2007 for details.

So what's a "memory location"?

In the proposed standard, a "memory location" is any scalar object, or contiguous sequence of bit-fields. (Such a sequence of bit-fields may be broken into separate memory locations by a zero-length bit-field.) Thus an update of one data member of a struct or class cannot interfere with an operation on another, except if they are both part of the same sequence of adjacently declared bit-fields. Concurrent updates to members of the same sequence of bit-fields are not allowed.

Why the weird exception for bit-fields?

Most processors cannot efficiently update individual bits without reading and writing back adjacent bits. That might cause an update to one of the adjacent bits to be lost.

But what about benign data races?

No such beast. At least not in C++0x programs.

(This notion does often exist at the machine code level. Hence optimizers might introduce them. But you should never write them in source code.)

Of course, there is no shortage of applications and operating-system kernels written in C or C++ that do contain code with data races. For more on this subject, see the discussion of current compilers near the end of this document.

Does that mean that if I write a program with a data race, the machine is allowed to catch on fire?

Yes. As far as the C++ standard is concerned. (This is probably inconsistent with local fire regulations, however.)

But what really happens?

It would be desirable for the machine to stop your program and report stack traces identifying the race. Unfortunately, in real life, other outcomes, e.g. a completely unexpected segmentation fault, are probably also possible. In fact, the difficulty of avoiding such outcomes in heavily optimized code were part of the motivation for giving undefined semantics to data races.

This seems rather draconian. After all these years, why would you outlaw data races like this?

This is actually nothing new. The current pthread standard (Single Unix Specification version 3, Base Definitions, 4.10) states: "Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it." This is not a recent addition. Nor was this original with pthreads. The Ada 83 and later standards contain a similar restriction. What is new is that C++0x is proposing to remove the reasons to violate this rule.

There are several technical reasons, beyond continuity with existing standards, to not only continue with this rule, but to more aggressively insist on adherence to it:

Code with data races is extremely difficult to get right and understand. If you are clever enough to do so, you should at least extend the reader courtesy of identifying such tricky code by labelling the affected variables as atomic.
Some hardware does not support efficient atomic updates to certain scalar objects. Another thread trying to concurrently read that value may only see half the update. This is less common now, but is believed to be an issue for some embedded processors. It is certainly an issue for large scalar objects,such as 64-bit integers on 32-bit hardware.
Compilers are allowed to assume that the value of ordinary variables does not change asynchronously. We believe that this assumption is pervasive, though rarely reflected in the resulting code. It does mean that when faced with code such as
```
unsigned x;

...
if (x < 5) {
  ... code that doesn't change x ...
  switch (x) {
    case 0:
    	...
    case 1:
    	...
    case 2:
    	...
    case 3:
    	...
    case 4:
    	...
  }
}
```
the compiler is allowed to compile the switch statement using a branch table and omit the bounds check for the branch table, since it already knows that x < 5. If in fact the resulting code loads x twice, and x changes in the middle, the resulting code will take a wild branch to an unintended address. There are many other, less spectacular, potential potential failures along these lines.
The introduction of atomic objects provides essentially the same performance as was previously possible with "benign" races, but clearly defines the responsibilities of both the programmer and the compiler implementor.
Prohibiting races simplifies the job of race detection tools. Races are bugs. By definition.
We don't have to define "benign races", which would be a complicated undertaking, and make this FAQ much more confusing.

I've heard that when writing multi-threaded code I have to worry about the order in which memory operations become visible to other threads. How can I deal with that?

If you follow the above rule (no data races!), and you avoid "low level" atomic operations, you don't. Threads will appear to execute as though they execute all the steps from each thread in some interleaved fashion, i.e. the way you probably expect.

In reality, this illusion is maintained mostly because it would require a data race for you to be able to tell that the code is not being executed like that. (And trust us, it isn't.) But that again means that if you introduce a data race, all bets are off. Unfortunately, it even means that if you accidentally introduce a data race, you may encounter strange results while debugging. (Certainly vendors are encouraged to minimize that effect, particularly at low optimization levels. But the (proposed) standard makes no claims about this.)

How do I recognize a "low level" atomic operation when I see one?

These are operations that have explicit memory ordering constraints like "acquire", "release", or "relaxed" in their name. The standards committee is still discussing naming, and how to make them easily identifiable.

When should I use low level atomic operations?

You shouldn't unless both:

The alternatives lead to inadequate performance, and you've determined that high level atomics are the problem, and
You sufficiently understand the relevant memory ordering issues, something that is generally well beyond this FAQ.

There is one common case in which it is relatively safe to use store_release and load_acquire instead of the higher level assignment operators. This is the case in which a variable is used to signal readiness (e.g. by storing a flag or a pointer to newly ready data) to another thread, which simply tests it, or waits for it to change. This does not apply if this is being used as part of a more complicated protocol that involves another atomic variable.

In this case, you may see a significant performance benefit, but only on a few architectures, and the resulting code will be harder to reason about as a result. Generally low level atomic operations are intended for implementors of a few other performance critical libraries.

We expect that in some cases coding standards will prohibit the use of low level atomic operations.

Should shared variables be declared as `volatile` even if they are protected by locks?

No. First, declaring lock-protected shared variables as volatile is unnecessary and needlessly disables compiler optimizations. Second, it is likely that some library vendors will be slow to introduce multithreading. Developers of multi-threaded programs using such libraries will therefore need to introduce locking to protect shared variables that are used within library functions, even when those developers do not have access to the library source code. Given the proposed standard, in many cases, the developer can simply hold a lock across each call to the library function. In contrast, if the standard required all shared variable to be declared volatile, single-threaded libraries could not be safely linked to multithreaded programs.

But parallel applications written in both C and C++ have been in existence for decades, and many of them do in fact have data races. Given that, what is all this fuss about?

Such applications make use of non-standard extensions and/or rely on properties of specific compiler implementations, properties that are explicitly not guaranteed by, for example, the 1995 Posix thread standard. Such code is at best safe with a well-developed set of conventions tuned to the particular compiler and possibly hardware platform. Although statistics are hard to come by, we know of a number of occasions on which it has led to subtle bugs, in spite of standard conforming, and arguably quite reasonable, compiler behavior.

Perhaps most importantly, such coding practices cannot guarantee continued correctness and portability of the application as the compiler evolves and more aggressive optimizations are added, or as the application is moved to a different compiler. The point of the standard is to clearly define the rules by which both the C++ programmer and the C++ compiler-writer must play in order to ensure continuing application correctness.

But why can't we standardize something that better accommodates existing programs with races?

The current standard proposal tries to precisely define rules that are reasonably simple, and hence understandable and teachable, and that preserve performance of the large majority of code that avoids races using locks and similar synchronization mechanisms. Providing semantics for races is complicated, and slows down lock-based race-free code. Providing suitably weak semantics for data races as was (and had to be) done for Java, is unlikely to legitimize many existing C or C++ programs with races, since these programs usually either:

Also make assumptions about memory visibility that would be expensive to enforce universally, or
Rely on other nonstandard constructs (e.g. some kind of explicit memory fences) to guarantee the ncessaroy memory visibility ordering.

The advantage of adopting the multi-threading constructs that we expect to be available in the C++0x standard is that you get standard-mandated multi-threaded safety and full access to advanced optimization techniques.

It is likely that compilers that have seen extensive multi-threaded use will provide facilities to ease transition from traditional assumptions to the upcoming standard, but of course the standard has no way to mandate such facilities. If you care about such facilities, we therefore advise you to discuss this with your compiler vendor/provider.

Can you give a specific example of optimizations that would be prohibited if some races were allowed?

That depends on exactly what is allowed.

If we tried to guarantee that programs with data races behaved sequentially consistently, i.e. as though thread steps were simply interleaved, most code scheduling and loop optimizations would break, at least in the absence of whole program analysis. Thus in reality we would have to provide very weak ordering guarantees for races, which would not do much to preserve correctness of existing code.

If pointer assignment were guaranteed to be atomic, the language would become hard to implement on some embedded processors that don't support atomic pointer-sized stores. Even on larger machines, it is unclear what types should behave atomically and be allowed in races.

It would also be hard to specify, for example, what happens when an object with virtual functions is communicated via a simple pointer assignment to another thread. In many cases, some sort of fence would be needed to ensure visibility of the vtable pointer to the receiving thread. But the vtable pointer is an implementation detail that shouldn't be programmer visible. And it would be expensive to implicitly generate these fences.

Common compiler optimization assumptions would be invalidated. The resulting failures seem to be infrequent in practice, but hard to characterize.

Given that I don't have access to a C++0x implementation, how much of this applies?

There are three areas in which current implementations generally do not behave as we have described:

There are no atomic data types or operations, as we described here.
The notion of a "memory location" used in the definition of a data race is intentionally undefined. That means that updating a scalar object may interfere with an update of a nearby one.
There is no clear prohibition against "spurious" compiler-introduced stores of the old value back into a global variable x, when such an update is not called for by the source. This may cause a concurrent update to x to be lost, somewhat unpredictably. Many compilers can introduce such updates, though they rarely do so in cases in which it turns out to matter. (See Boehm, Threads Cannot be Implemented as a Library, PLDI 2005 for details.)

How do I deal with the absence of atomic operations before C++0x?

The official Posix answer is to use locks instead. If it is remotely practical, that is also the correct answer. The rest of this applies to the remaining cases.

There are many alternatives, all imperfect. There are several packages that provide a generic interface to atomic operations with platform-specific implementations. atomic_ops is one of these. Gcc provides __sync_ operations. Microsoft provides Interlocked operations. They are generally still evolving as we understand the issues better. Many of these, e.g. the gcc __sync_ operations, do not provide a special atomic load, and hence have no way to inform the compiler, other than with the volatile qualifier, that a value may change asynchronously. We recommend that whichever you use, you use a macro to identify such loads, to make it easier to later convert to the more correct C++0x approach.

So should I use `volatile` to identify variables modified by another thread?

For C++0x, no. Currently, the official answer for pthreads is also no. Data races on volatiles are disallowed along with other races. However, it appears to be the only currently available standard mechanism for notifying the compiler that an asynchronous change is possible, and thus some optimizations, such as the one on the switch statement above, are unsafe. Thus in cases in which a race cannot be avoided due to the absence of a suitable atomic operations, and locks are too slow, it seems by far the safest option.

If you do use volatile remember that its detailed semantics vary dramatically across platforms. On some platforms, two consecutive volatile stores may become visible out of order to two consecutive volatile loads in another thread. On other platforms, that is explicitly prevented. Thus platform-dependent mechanisms for memory ordering are usually also needed. (The atomic_ops package uses volatile in platform-dependent ways internally, but adds fences to enforce requested memory ordering.)

It is also important to remember that volatile updates are not necessarily atomic. They may appear to be carried out piecemeal. Whether or not they actually are atomic depends on the platform and alignment.

How about using explicit platform-dependent memory fences with ordinary memory operations, as the Linux kernel does?

We recommend against this for other applications, since doing so requires non-standard implementation-specific mechanisms. These mechanisms prevent the compiler from carrying out optimizations that would be unsafe given that asynchronous changes to a variable are possible.

If such mechanisms are not correctly used, problems may arise, particularly if a load involved in a race is not followed immediately by a fence. If we write

{
  int local = shared;

  if (local) perform some action;
  A: <code that affects neither shared not local>
  if (local) perform some compensating action;
}

If the compiler runs out of registers during the block A and decides to reuse the register holding local, it may legitimately decide to reload shared in order to recompute local, since it has no idea that shared may change asynchronously. Thus the compensating action may be performed without the original action or vice-versa. This is essentially the same problem as with the earlier switch example, which also applies here.

Given the potential pitfalls similar to that illustrated in the above example, it should be clear that obtaining correct results from current compilers given code containing data races requires deep expertise in the target CPU architecture(s), the specific compiler(s) to be used, and in the non-standard extensions that are used to prevent the compiler(s) from carrying out optimizations that, while perfectly safe in single-threaded code, lead to incorrect results when applied to multi-threaded code with data races.

But can't I simply use a volatile cast to prevent this problem?

Maybe, and maybe not, depending on your compiler and on the flags passed to it. So some compilers emit expensive memory fences around volatile accesses, and others don't. You might or might not want such fences, depending on what you are doing.

This may seem like a sorry state of affairs, but please keep in mind that the "volatile" keyword was introduced into the C language long before multi-threaded code and multiprocessor systems were commonly available.

So is it OK to have data races until we get real atomic operations?

No. At least not in portable code or without a thorough understanding of the compiler involved. See the previous answers.

So how do I deal with the adjacent field over-write issue?

Some compilers provide a flag to improve thread-safety, and other compilers provide attributes to force a given data item to be aligned an padded so as to reduce the issues with data races. If your compiler has such facilities, use them.

If you are generating code for an Alpha processor, tell your compiler to generate code for recent processors that support byte loads and stores. If you are generating code for an older Alpha processor, you will need to modify your application to ensure that bytes protected by separate locks are in separate words. Or you will need to upgrade to a more modern CPU.

By far the most common violation of the C++0x rules on current compilers is the treatment of bit-fields that share a word with non-bit-fields, as in

struct {char a; int b:5; int c:9, char d;};

which might occupy a single 32-bit word. Avoid such structures if e.g. a and b might be concurrently updated. This can be done by separating bit-fields from other members with a sufficiently large (usually at most int) padding field. Alternatively, one can combine the bit-field variables into a single basic type (e.g., an int), and use explicit masking and atomic operations to access and update the individual fields, as is often done in operating-system kernel code. To be even more conservative, one could separate other potentially concurrently modified fields in a similar manner.

We know of no compilers that allow something other than a struct or class member to update something other than an adjacent struct or class member, although this does appear to be allowed by current standards.

So how do I deal with potential spurious stores in current compilers?

The good news appears to be that these are introduced primarily in cases in which they do not introduce a bug. If count is a potentially shared variable, many compilers would compile the loop

for (p = q; p != 0; p = p->next)
  if (p -> data > 0) count++;

by unconditionally loading count into a register at the beginning of the loop, and unconditionally writing it back at the end, even if the loop happened to never update count because all the data was negative. This can introduce a race and lose an update to count if it is being concurrently modified by another thread, presumably because the programmer is counting on the knowledge that none of the data is positive.

Needless to say, such correctness arguments should be avoided.

Although it is common for compilers to generate technically incorrect (by proposed C++0x standards) for this code, the actual failure scenario seems fairly far-fetched.

More dangerous miscompilations, such as the example described in Boehm, Threads Cannot be Implemented as a Library, PLDI 2005 are fortunately rare, and it is unclear that there are effective counter-measures, other than modifying compilers to comply to the C++0x rules; significant reduction of optimization level; using non-standard directives or compiler switches that prevent such behavior for specified accesses, variables, or compilation units; or manually checking the resulting assembly code.

Explicit optimization-suppression directives in many current compilers may allow this type of code to be safely written. Of course, all such usage is non-standard and very likely also non-portable.

How much of this applies to Java?

In Java, high-level atomics are called "volatile". And races are strongly discouraged, but not completely prohibited. (They cannot be prohibited, due to the need to support untrusted sand-boxed code.) Bit-fields do not exist. Other than that, the current rules are very similar to the proposed C++0x rules.