Programming with Threads: Questions Frequently Asked by C and C++ Programmers
Authors: Hans-J. Boehm, HP Labs & Paul McKenney, IBM
Disclaimer: These are only the authors' opinions.
In the process of trying to define the semantics of threads for C++0x, it
has become clear that there are widespread misunderstandings about the
use of threads in these languages. This is an attempt to correct
some of these.
We cheat in several ways:
- To start with, we pretend that something similar to our
current proposals were part of the current language standards, and you
could use them with your current compiler. They're not. You can't.
They're not even part of the working paper, yet. (For the most part,
the basic ideas are uncontroversial, so the odds are good they will be.
And nobody currently expects the C++0x standard to be voted out without thread
support. But details will change.) We do try to briefly address the
additional issues arising with current compilers near
the end of this document.
- Not all of these are really frequently asked questions. We made up some
that we wish people would ask.
Is this known to reflect the position of any major organization or
corporation?
No. At least not currently.
Why are you (mostly) assuming the existence of a standard that is at
least two years out?
Because we're not sufficiently satisfied with the current answers. See below.
What basic rules should I follow when writing multi-threaded C or C++
code?
The basic rule you need to follow when writing multi-threaded code is:
Do not allow data races.
This of course is not the only thing you need to know to write correct parallel
code. You still need to make sure that actions that should logically
be indivisible actually are, that you avoid deadlocks, that your algorithm
is sufficiently parallel for the performance you need, etc. But we do
think it addresses much of the confusion that has surrounded this area.
In the past, it was much easier to justify code with "benign" data races,
and such code is quite common, particularly for older code, or code
that could not tolerate the performance overhead associated with locking.
Atomic objects now provide a clearer and
safer way to write such code.
So what's a "data race"?
A data race occurs if two accesses to the same
memory location by different
threads are not ordered,
at least one of them stores to the memory location,
and at least one of them is not a synchronization action.
So what's a "synchronization action"?
An operation on a specially declared atomic<T>
object (or C equivalent),
or an operation on a lock, condition variable, or the like.
An object that may safely be concurrently manipulated by more than one thread,
even if one if them is updating the object. We currently expect the type of
such an object to be spelled atomic<T>, though there maybe
other ways to do so by the time the standard is finished. Updates to
such objects appear indivisible; a thread concurrently reading the
object will see the old or the new value, never something in between.
In order to ensure this, an atomic<T> variable, unlike
a volatile T variable, may have stricter alignment constraints
than a plain type T variable.
If you are not being devious, two memory operations are ordered if they
cannot occur simultaneously. If you are being devious, see the
"happens before" definition in the
standards proposal. To our knowledge, the two diverge only in the
case of blatant abuse of a trylock (nonblocking lock acquisition) primitive.
See Boehm,
Reordering Constraints for Pthread-Style Locks,
PPOPP 2007 for details.
In the proposed standard, a "memory location" is any scalar object,
or contiguous sequence of bit-fields. (Such a sequence of bit-fields
may be broken into separate memory locations by a zero-length bit-field.)
Thus an update of one data member of a struct or class cannot interfere
with an operation on another, except if they are both part of the
same sequence of adjacently declared bit-fields. Concurrent updates
to members of the same sequence of bit-fields are not allowed.
Why the weird exception for bit-fields?
Most processors cannot efficiently update individual bits without reading
and writing back adjacent bits. That might cause an update to one of
the adjacent bits to be lost.
But what about benign data races?
No such beast. At least not in C++0x programs.
(This notion does often exist at the machine code level. Hence
optimizers might introduce them. But you should never write them
in source code.)
Of course, there is no shortage of applications and operating-system kernels
written in C or C++ that do contain code with data races.
For more on this subject, see the discussion of current compilers
near the end of this document.
Does that mean that if I write a program with a data race, the machine
is allowed to catch on fire?
Yes. As far as the C++ standard is concerned. (This is probably inconsistent
with local fire regulations, however.)
But what really happens?
It would be desirable for the machine to stop your program
and report stack traces identifying the race. Unfortunately, in real
life, other outcomes, e.g. a completely unexpected segmentation fault,
are probably also possible. In fact, the difficulty of avoiding
such outcomes in heavily optimized code were part of the
motivation for giving undefined semantics to data races.
This seems rather draconian. After all these years,
why would you outlaw data races like this?
This is actually nothing new. The current pthread standard
(Single Unix Specification version 3, Base Definitions, 4.10) states:
"Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it." This is not a recent
addition.
Nor was this original with pthreads. The Ada 83 and later standards contain
a similar restriction. What is new is that C++0x is proposing to remove
the reasons to violate this rule.
There are several technical reasons, beyond continuity with
existing standards, to not only continue with this rule, but to
more aggressively insist on adherence to it:
I've heard that when writing multi-threaded code I have to worry about
the order in which memory operations become visible to other threads.
How can I deal with that?
If you follow the above rule (no data races!), and you avoid "low level"
atomic operations, you don't. Threads will appear to execute as though they
execute all the steps from each thread in some interleaved fashion, i.e.
the way you probably expect.
In reality, this illusion is maintained mostly because it would require
a data race for you to be able to tell that the code is not being executed
like that. (And trust us, it isn't.) But that again means that if
you introduce a data race, all bets are off. Unfortunately, it even
means that if you accidentally introduce a data race, you may encounter
strange results while debugging. (Certainly vendors are encouraged
to minimize that effect, particularly at low optimization levels. But
the (proposed) standard makes no claims about this.)
How do I recognize a "low level" atomic operation when I see one?
These are operations that have explicit memory ordering constraints
like "acquire", "release", or "relaxed" in their name. The standards
committee is still discussing naming, and how to make them easily identifiable.
When should I use low level atomic operations?
You shouldn't unless both:
- The alternatives lead to inadequate performance, and you've determined that
high level atomics are the problem, and
- You sufficiently understand the relevant memory ordering issues, something
that is generally well beyond this FAQ.
There is one common case in which it is relatively
safe to use store_release
and load_acquire instead of the higher level assignment
operators. This is the case in which a variable is used to signal readiness
(e.g. by storing a flag or a pointer to newly ready data) to another thread,
which simply tests it, or waits for it to change. This does not apply if
this is being used as part of a more complicated protocol that involves
another atomic variable.
In this case, you may see a significant performance benefit, but only on
a few architectures, and the resulting code will be harder to reason about
as a result. Generally low level atomic operations are intended for
implementors of a few other performance critical libraries.
We expect that in some cases coding standards will prohibit the use of
low level atomic operations.
Should shared variables be declared as volatile even if
they are protected by locks?
No. First, declaring lock-protected shared variables as volatile
is unnecessary and needlessly disables compiler optimizations.
Second, it is likely that some library vendors will be slow to
introduce multithreading. Developers of multi-threaded programs
using such libraries will therefore need to introduce locking to
protect shared variables that are used within library functions,
even when those developers do not have access to the library
source code. Given the proposed standard, in many cases,
the developer can simply hold a lock across each call to the
library function. In contrast, if the standard required all
shared variable to be declared volatile, single-threaded libraries
could not be safely linked to multithreaded programs.
Such applications make use of non-standard extensions and/or rely
on properties of specific compiler implementations, properties that
are explicitly not guaranteed by, for example, the 1995 Posix thread
standard. Such code is at best safe with a well-developed set of
conventions tuned to the particular compiler and possibly hardware
platform. Although statistics are hard to come by, we know of a number of
occasions on which it has led to subtle bugs, in spite of
standard conforming, and arguably quite reasonable, compiler behavior.
Perhaps most importantly, such coding practices cannot guarantee continued
correctness and portability of the application as the compiler evolves
and more aggressive optimizations are added, or as the application is
moved to a different compiler. The point of the standard is to clearly
define the rules by which both the C++ programmer and the C++ compiler-writer
must play in order to ensure continuing application correctness.
But why can't we standardize something that better accommodates
existing programs with races?
The current standard proposal tries to precisely define rules that
are reasonably simple, and hence understandable and teachable, and that
preserve performance of the large majority of code that avoids
races using locks and similar synchronization mechanisms. Providing
semantics for races is complicated, and slows down lock-based
race-free code. Providing suitably weak semantics for data races
as was (and had to be) done for Java, is unlikely to legitimize
many existing C or C++ programs with races, since these programs
usually either:
- Also make assumptions about memory visibility that would be
expensive to enforce universally, or
- Rely on other nonstandard constructs (e.g. some kind of explicit
memory fences) to guarantee the ncessaroy memory visibility ordering.
The advantage of adopting the multi-threading constructs
that we expect to be available in the C++0x standard is that you
get standard-mandated multi-threaded safety and full access to
advanced optimization techniques.
It is likely that compilers that have seen extensive multi-threaded
use will provide facilities to ease transition from traditional
assumptions to the upcoming standard, but of course the standard
has no way to mandate such facilities.
If you care about such facilities, we therefore advise you to discuss
this with your compiler vendor/provider.
Can you give a specific example of optimizations that would be
prohibited if some races were allowed?
That depends on exactly what is allowed.
If we tried to guarantee that
programs with data races behaved sequentially consistently, i.e. as
though thread steps were simply interleaved, most code scheduling
and loop optimizations would break, at least in the absence of whole
program analysis. Thus in reality we would have to provide very weak
ordering guarantees for races, which would not do much to preserve
correctness of existing code.
If pointer assignment were
guaranteed to be atomic, the language would become hard to implement
on some embedded processors that don't support atomic pointer-sized
stores. Even on larger machines, it is unclear what types should
behave atomically and be allowed in races.
It would also be hard to specify, for example, what happens when
an object with virtual functions is communicated via a simple pointer
assignment to another thread. In many cases, some sort of fence would
be needed to ensure visibility of the vtable pointer to the receiving
thread. But the vtable pointer is an implementation detail that shouldn't
be programmer visible. And it would be expensive to implicitly generate
these fences.
Common compiler optimization assumptions
would be invalidated. The resulting failures seem to be infrequent
in practice, but hard to characterize.
Given that I don't have access to a C++0x implementation, how much
of this applies?
There are three areas in which current implementations generally
do not behave as we have described:
- There are no atomic data types or operations, as we described here.
- The notion of a "memory location" used in the definition of a data
race is intentionally undefined. That means that updating a scalar
object may interfere with an update of a nearby one.
- There is no clear prohibition against "spurious" compiler-introduced
stores of the old value back into a global variable x, when such an
update is not called for by the source. This may cause a concurrent
update to x to be lost, somewhat unpredictably. Many compilers
can introduce such updates, though they rarely do so in cases in which
it turns out to matter. (See
Boehm, Threads Cannot be Implemented as a Library, PLDI 2005 for details.)
How do I deal with the absence of atomic operations before
C++0x?
The official Posix answer is to use locks instead. If it is
remotely practical, that is also the correct answer. The rest
of this applies to the remaining cases.
There are many alternatives, all imperfect. There are several
packages that provide a generic interface to atomic operations with
platform-specific implementations.
atomic_ops
is one of these. Gcc provides __sync_ operations. Microsoft
provides Interlocked operations. They are
generally still evolving as we understand the issues better.
Many of these, e.g. the gcc __sync_
operations, do not provide a special atomic load, and hence have no
way to inform the compiler, other than with the volatile
qualifier, that a value may change asynchronously. We recommend that
whichever you use, you use a macro to identify such loads,
to make it easier to later convert to the more correct C++0x approach.
So should I use volatile to identify variables modified by
another thread?
For C++0x, no. Currently, the official answer for pthreads is also no.
Data races on volatiles are disallowed along with other races.
However, it
appears to be the only currently available standard mechanism for notifying the
compiler that an asynchronous change is possible, and thus some optimizations,
such as the one on the switch statement above, are unsafe.
Thus in cases in which a race cannot be avoided due to the absence of a
suitable atomic operations, and locks are too slow, it seems by far the
safest option.
If you do use volatile remember that its detailed semantics vary
dramatically across platforms. On some platforms, two consecutive
volatile stores may become visible out of order to two consecutive
volatile loads in another thread. On other platforms, that is
explicitly prevented. Thus platform-dependent mechanisms for memory
ordering are usually also needed. (The atomic_ops package uses
volatile in platform-dependent ways internally, but adds fences
to enforce requested memory ordering.)
It is also important to remember that volatile updates
are not necessarily atomic. They may appear to be carried out piecemeal.
Whether or not they actually are atomic depends on the platform and
alignment.
How about using explicit platform-dependent memory fences with ordinary
memory operations, as the Linux kernel does?
We recommend against this for other applications, since doing so requires
non-standard implementation-specific mechanisms.
These mechanisms prevent the compiler
from carrying out optimizations that would be unsafe given
that asynchronous changes to a variable are possible.
If such mechanisms are not correctly used, problems
may arise, particularly if a load involved in a race
is not followed immediately by a fence. If we write
{
int local = shared;
if (local) perform some action;
A: <code that affects neither shared not local>
if (local) perform some compensating action;
}
If the compiler runs out of registers during the block A and decides to
reuse the register holding local, it may legitimately decide to
reload shared in order to recompute local, since it has
no idea that shared may change asynchronously. Thus the compensating
action may be performed without the original action or vice-versa.
This is essentially the same problem as with the earlier switch
example, which also applies here.
Given the potential pitfalls similar to that illustrated in the above
example, it should be clear that obtaining
correct results from current compilers given code containing data
races requires deep expertise in the target CPU architecture(s),
the specific compiler(s) to be used, and in the non-standard extensions that
are used to prevent the compiler(s) from carrying out optimizations
that, while perfectly safe in single-threaded code, lead to incorrect
results when applied to multi-threaded code with data races.
But can't I simply use a volatile cast to prevent this problem?
Maybe, and maybe not, depending on your compiler and on the flags passed
to it.
So some compilers emit expensive memory fences around volatile accesses,
and others don't.
You might or might not want such fences, depending on what you are doing.
This may seem like a sorry state of affairs,
but please keep in mind that the "volatile" keyword was introduced into
the C language long before multi-threaded code and multiprocessor
systems were commonly available.
So is it OK to have data races until we get real atomic operations?
No. At least not in portable code or without a thorough
understanding of the compiler involved. See the previous answers.
So how do I deal with the adjacent field over-write issue?
Some compilers provide a flag to improve thread-safety, and other
compilers provide attributes to force a given data item to be
aligned an padded so as to reduce the issues with data races.
If your compiler has such facilities, use them.
If you are generating code for an Alpha processor, tell your compiler
to generate code for recent processors that support byte loads and stores.
If you are generating code for an older Alpha processor,
you will need to modify your application to ensure that
bytes protected by separate locks are in separate words.
Or you will need to upgrade to a more modern CPU.
By far the most common violation of the C++0x rules on current compilers
is the treatment of bit-fields that share a word with non-bit-fields, as
in
struct {char a; int b:5; int c:9, char d;};
which might occupy a single 32-bit word. Avoid such structures if e.g.
a and b might be concurrently updated.
This can be done by separating
bit-fields from other members with a sufficiently large (usually at most int)
padding field.
Alternatively, one can combine the bit-field variables into a single
basic type (e.g., an int), and use explicit masking and atomic operations
to access and update the individual fields, as is often done in
operating-system kernel code.
To be even more conservative, one could separate other
potentially concurrently modified fields in a similar manner.
We know of no compilers that allow something other than a struct or class
member to update something other than an adjacent struct or class member,
although this does appear to be allowed by current standards.
So how do I deal with potential spurious stores in current compilers?
The good news appears to be that these are introduced primarily in
cases in which they do not introduce a bug. If count
is a potentially shared variable, many compilers would
compile the loop
for (p = q; p != 0; p = p->next)
if (p -> data > 0) count++;
by unconditionally loading count into a register at the beginning
of the loop, and unconditionally writing it back at the end, even if the loop
happened to never update count because all the data was negative.
This can introduce a race and lose an update to count if it is
being concurrently modified by another thread, presumably because the
programmer is counting on the knowledge that none of the data is positive.
Needless to say, such correctness arguments should be avoided.
Although it is common for compilers to generate technically incorrect (by
proposed C++0x standards) for this code, the actual failure scenario
seems fairly far-fetched.
More dangerous miscompilations, such as the example
described in
Boehm, Threads Cannot be Implemented as a Library, PLDI 2005
are fortunately rare,
and it is unclear that there are effective
counter-measures, other than modifying compilers to comply to the
C++0x rules; significant reduction of optimization level;
using non-standard directives or compiler switches that prevent such
behavior for specified accesses, variables, or compilation units;
or manually checking the resulting assembly code.
Explicit optimization-suppression directives
in many current compilers may allow this type of code to be safely
written.
Of course, all such usage is non-standard and very likely also non-portable.
How much of this applies to Java?
In Java, high-level atomics are called "volatile". And races are strongly
discouraged, but not completely prohibited. (They cannot be prohibited,
due to the
need to support untrusted sand-boxed code.) Bit-fields do not exist.
Other than that, the current rules
are very similar to the proposed C++0x rules.