ISO/IEC JTC1 SC22 WG21 N2745 = 08-0255 - 2008-08-22 (REVISED)
Paul E. McKenney, paulmck@linux.vnet.ibm.com
Raul Silvera, rauls@ca.ibm.com
This document presents a PowerPC implementation of the proposed C/C++ memory-order model, including the bidirectional fences proposed in N2731, and analyzes some representative non-SC (sequentially consistent) code sequences. Sequences involving both acquire and dependency-ordering operations are analyzed. This document is based on N2745. The scope of this document is the Power and PowerPC systems, excluding the Book E variant of PowerPC.
The authors owe a debt of gratitude to Derek Williams, Cathy May, and Brad Frey for their careful and patient explanations of the PowerPC memory model. However, all mistakes within are the sole property of the authors. We are also thankful to Peter Dimov, Alexander Terekhov, and Herb Sutter for many helpful discussions.
There are any number of possible mappings from the C/C++ memory-ordering primitives onto the PowerPC instruction set. This document assumes the following mapping, originally derived by Raul:
Operation | PowerPC Implementation |
---|---|
Load Relaxed | ld |
Load Consume | ld |
Load Acquire | ld; cmp; bc; isync |
Load Seq Cst | hwsync; ld; cmp; bc; isync |
Store Relaxed | st |
Store Release | lwsync; st |
Store Seq Cst | hwsync; st |
Cmpxchg Relaxed,Relaxed (32 bit) | _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit: |
Cmpxchg Acquire,Relaxed (32 bit) | _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit: |
Cmpxchg Release,Relaxed (32 bit) | lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit: |
Cmpxchg AcqRel,Relaxed (32 bit) | lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit |
Cmpxchg SeqCst,Relaxed (32 bit) | hwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit |
Acquire Fence | lwsync |
Release Fence | lwsync |
AcqRel Fence | lwsync |
SeqCst Fence | hwsync |
Of course, compilers targeted to specific members of the PowerPC family
may be able to make use of lighter-weight code sequences.
In particular, a compiler targeting a uniprocessor-only PowerPC processor
would be able to omit all of the above code, though such a compiler would
still need to adhere to instruction-ordering
constraints in a multiprogrammed environment.
In addition, alternative code sequences may be used in some cases,
for example, the twi
instruction may be used in place of
the cmp; bc
sequences above.
Applications that only manipulate normal data structures in normal
cacheable memory (as opposed to doing MMIO operations or directly
manipulating framebuffers) can replace the hwsync; st
for sequentially consistent store with lwsync; st
.
Finally, applications that need to use C/C++ sequentially
consistent atomic operations on Power 5/5+ must adhere to the rules
set out separately.
The cmpxchg
sequences above are for strong 32-bit operations.
64-bit cmpxchg
sequences would substitute ldarx
for lwarx
and stdcx.
for stwcx.
Weak cmpxchg
sequences could omit the backwards branch to
_loop
.
There is some confusion surrounding the hwsync
and
lwsync
opcodes, due to the usual historical reasons.
The equivalences are shown below:
Operation | PowerPC Opcode | Alternative Opcode |
---|---|---|
hwsync |
sync |
sync 0 |
lwsync |
lwsync |
sync 1 |
isync |
isync |
|
For convenience, the lock-acquisition sequence from Section B.2.1.1 of Book II of the PowerPC Architecture is adapted below:
The code following the label wait
would depend strongly on
the enclosing software environment.
Pure spinlocks would of course branch directly to loop
rather than to wait
.
In some cases, implementors will prefer to use an immediate-mode compare
rather than storing the lock-free pattern in a register.
Again for convenience, the lock-release sequence from Section B.2.2.1 of Book II of the PowerPC Architecture is adapted below:
If the critical section contains only normal cached memory operations,
the lwsync
barrier may be substituted for the heavier-weight
sync
instruction above.
Alternatively, if non-cached memory operations such as MMIO operations
are associated with memory barriers of their own,
as is the case in the Linux kernel,
then once again
the lwsync
barrier may be substituted for the heavier-weight
sync
instruction shown above.
Finally, as with the lock-acquisition sequence, in some cases,
implementors will prefer to use an immediate-mode store
rather than storing the lock-free pattern in a register.
The C/C++ definition of “modification order” maps to the PowerPC notion of “memory coherence”. From Section 1.6.3 of PowerPC Book 2 describes memory coherence as follows:
An access to a Memory Coherence Required storage location is performed coherently, as follows.
Memory coherence refers to the ordering of stores to a single location. Atomic stores to a given location are coherent if they are serialized in some order, and no processor or mechanism is able to observe any subset of those stores as occurring in a conflicting order. This serialization order is an abstract sequence of values; the physical storage location need not assume each of the values written to it. For example, a processor may update a location several times before the value is written to physical storage. The result of a store operation is not available to every processor or mechanism at the same instant, and it may be that a processor or mechanism observes only some of the values that are written to a location. However, when a location is accessed atomically and coherently by all processor and mechanisms, the sequence of values loaded from the location by any processor or mechanism during any interval of time forms a subsequence of the sequence of values that the location logically held during that interval. That is, a processor or mechanism can never load a “newer” value first and then, later, load an “older” value.
Memory coherence is managed in blocks called coherence blocks. Their size is implementation-dependent (see the Book IV, PowerPC Implementation Features document for the implementation), but is larger than a word and is usually the size of a cache block.
Release sequences include two cases, subsequent stores by the same thread that performed the release, and non-relaxed atomic read-modify-write operations. The subsequent-stores case is covered by the following sentence of Section 1.6.3 of PowerPC Book 2 quoted above:
That is, a processor or mechanism can never load a “newer” value first and then, later, load an “older” value.
In addition, we shall see that the operation of PowerPC memory barriers causes the subsequent-stores case to trivially follow from the case where the acquire operation loads the head of the release chain.
The non-relaxed-atomic case has not been carefully studied,
but the authors are confident that use of either the lwsync
or the hwsync
instruction will suffice to enforce this
ordering.
A particular concern is the relationship of atomic operations and
release sequences, which are explored in a later section.
The relevant wording from PowerPC Book 2 describing PowerPC's atomic read-modify-write sequences is as follows:
The store caused by a successful “stwcx.” is ordered, by a dependence on the reservation, with respect to the load caused by the “lwarx” that established the reservation, such that the two storage accesses are performed in program order with respect to any processor or mechanism.
The “synchronizes with” relation is handled by the
placement of the memory barriers in the code sequences table above.
The stores to object M follow a lwsync
or
hwsync
instruction, and the corresponding loads
precede one of a number of instruction sequences called out in
the table.
These instructions are described in Section 1.7.1 of
PowerPC Book 2, as follows:
When a processor (P1) executes a
sync
,lwsync
, oreieio
instruction a memory barrier is created, which orders applicable storage accesses pairwise, as follows. Let A be a set of storage accesses that includes all storage accesses associated with instructions preceding the barrier-creating instruction, and let B be a set of storage accesses that includes all storage accesses associated with instructions following the barrier-creating instruction. For each applicable pair ai,bj of storage accesses such that ai is in A and bj is in B, the memory barrier ensures that ai will be performed with respect to any processor or mechanism, to the extent required by the associated Memory Coherence Required attributes, before bj is performed with respect to that processor or mechanism.
The word “performed” is defined roughly as follows:
Section 1.7.1 of PowerPC Book 2 goes on to discuss cumulativity, which is somewhat similar to a causal ordering:
The ordering done by a memory barrier is said to be “cumulative” if it also orders storage accesses that are performed by processors and mechanisms other than P1, as follows.
- A includes all applicable storage accesses by any such processor or mechanism that have been performed with respect to P1 before the memory barrier is created.
- B includes all applicable storage accesses by any such processor or mechanism that are performed after a Load instruction executed by that processor or mechanism has returned the value stored by a store that is in B.
Note that B-cumulativity recurses on stores, as illustrated by the following sequence of operations:
a
.
This store will be in the memory fence's B-set.
a
that returns
the value stored by thread 0. Thread 1's load and all
of thread 1's operations that are ordered after that load
are in thread 0's memory fence's B-set.
b
that
is ordered after the load from a
(for example,
by either code or data dependency). This store is therefore
in Thread 0's memory fence's B-set.
b
that returns
the value stored by thread 1.
Thread 2's load and all of thread 1's operations that are
ordered after that load are in thread 0's memory fence's
B-set.
The recursive nature of B-cumulativity allows this sequence to be extended indefinitely.
In contrast, A-cumulativity has no such recursion. Only those operations performed with respect to the specific thread containing the memory fence (termed “memory barrier” in PowerPC documentation) will be in that memory fence's A-set. There is no A-cumulativity recursion through other threads' loads; such recursion is confined to B-cumulativity.
In both of the above passages from Section 1.7.1, the importance of the word “applicable” cannot be overstated. This word's meaning depends on the type of memory-fence instruction as follows:
eieio
: only stores are “applicable”.
hwsync
: both loads and stores are
“applicable”.
This instruction is often called sync
,
and provides full causal ordering (in fact, full
sequential consistency).
lwsync
applicability is limited to the following
cases:
lwsync
.
bc;isync
: this is a very low-overhead and
very weak form of memory fence.
A specific set of preceding loads on
which the bc
(branch conditional) instruction
depends are guaranteed to have completed
before any subsequent instruction begins execution.
However, store-buffer and cache-state effects can
nevertheless make it appear that subsequent loads
occur before the preceding loads
upon which the twi instruction depends. That
said, the PowerPC architecture does not permit
stores to be executed speculatively, so any store
following the twi;isync instruction is guaranteed to happen
after any of the loads on which
the bc
depends.
Note that the bc;isync
instruction sequence does not provide cumulativity.
This permits the following counter-intuitive sequence of
events, with all variables initially zero, and results of
loads in square brackets following the load:
x=1
r1=x
[1]
bc;isync
y=1
r2=y
[1]
bc;isync
r3=x
[0]
The problem is that the ordering between CPU 0's store and CPU 1's load is not globally visible, nor is the ordering between CPU 1's store and CPU 2's load. This sequence of events is more likely to occur on systems where CPUs 0 and 1 are closely related, for example, when CPUs 0 and 1 are hardware threads in one core and CPU 2 is a hardware thread in another core.
The effects of the isync
instruction is described in
the program note in Section 1.7.1 of PowerPC book 2:
Because an
isync
instruction prevents the execution of instructions following theisync
until instructions preceding theisync
have completed, if anisync
follows a conditional Branch instruction that depends on the value returned by a preceding Load instruction, the load on which the Branch depends is performed before any loads caused by instructions following theisync
. This applies even if the effects of the “dependency” are independent of the value loaded (e.g., the value is compared to itself and the Branch tests the EQ bit in the selected CR field), and even if the branch target is the sequentially next instruction.
This might seem at first glance to prohibit the above code sequence,
however, please keep in mind that CPU 1's isync
is not
a cumulative memory barrier, and therefore does not guarantee that
CPUs that see CPU 1's store to y
will necessarily see
CPU 0's store to x
.
Note that it is sufficient to replace CPU 1's bc;isync
with an lwsync
in order to prevent CPU 2's load from
x
from returning zero, but the proof cannot rely on
B-cumulativity.
B-cumulativity will not come into play here because prior stores
(CPU 0's store to x) and subsequent loads (CPU 2's
load from x) are not applicable to the lwsync
instruction.
Instead, we note that CPU 0's store to x
will be performed
before CPU 1's store to y
with respect to all CPUs,
due to CPU 1's lwsync
instruction.
Furthermore, CPU 2's bc;isync
instruction ensures that
CPU 2's load from y
is performed before CPU 2's load
from x
with respect to all CPUs.
Therefore, if CPU 2's load from y
returns 1
(the new value), then CPU 2's subsequent load from x
must also return 1 (the new value), given that the store to x
was performed before the store to y
.
However, if CPU 1's bc;isync
were to be replaced with a
hwsync
, the proof can rely on simple B-cumulativity,
because prior stores and subsequent loads are applicable in the
case of hwsync
.
N2556 defines carrying a dependency as follows:
An evaluation A carries a dependency to an evaluation B if
- the value of A is used as an operand of B, and:
or
- B is not an invocation of any specialization of
std::kill_dependency
, and- A is not the left operand to the comma (',') operator,
- A writes a scalar object or bit-field M, B reads the value written by A from M, and A is sequenced before B, or
- for some evaluation X, A carries a dependency to X, and X carries a dependency to B.
In cases where evaluation B is a load, we refer to Section 1.7.1 of PowerPC Book 2:
If a Load instruction depends on the value returned by a preceding Load instruction (because the value is used to compute the effective address specified by the second Load), the corresponding storage accesses are performed in program order with respect to any processor or mechanism to the extent required by the associated Memory Coherence Required attributes. This applies even if the dependency has no effect on program logic (e.g., the value returned by the first Load is ANDed with zero and then added to the effective address specified by the second Load).
Where evaluation B is a store, we refer to Section 4.2.4 of PowerPC Book 3:
Stores are not performed out-of-order (even if the Store instructions that caused them were executed out-of-order). Moreover, address translations associated with instructions preceding the corresponding Store instruction are not performed again after the store has been performed.
The synchronizes-with examples involve one thread performing an evaluation A sequenced before a release operation on some atomic object M, concurrently with another thread performing an acquire operation on this same atomic object M sequenced before an evaluation B. These examples are the four cases where A and B are all combinations of relaxed loads and stores.
This example shows how the C++ synchronization primitives can be used to order loads. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_relaxed); y.store(1, memory_order_release);
r2 = y.load(memory_order_acquire); r3 = x.load(memory_order_relaxed);
x.store(1, memory_order_relaxed);
assert(r2==0||r1==0||r3==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
r1=x; | r2=y; | x=1; |
lwsync; | if (r2==r2) | |
y=1; | isync; | |
r3=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
Because lwsync
orders prior loads against subsequent
stores, this means that thread 0's load is performed before its
store with respect to all threads.
r1==1
,
by A-cumulativity, thread 2's store to x
is
in the lwsync
's A-set.
Because lwsync
orders stores, thread 2's store
to x
will be performed before thread 0's store
to y
with respect to all threads.
r2==1
, then we know that thread 0's
store to y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load from y
is performed
after thread 0's store to y
with respect to thread 1.
r1==1
and
r2==1
, thread 0's store to x
is
performed before thread 1's load from y
with respect to thread 1.
isync
,
thread 1's load from y
is performed before thread 1's
load from x
with respect to all processors.
r1==1
and r2==1
, we
know that r3==1
, so that the assert is satisfied.
(Note: we need not rely thread 1's load from x
being in the lwsync
's B-set.
This is fortunate given
that prior stores and subsequent loads are not applicable for
lwsync
.)
Note that the C++ memory model does not actually require the loads to be ordered in this case, since thread 0's relaxed store is not guaranteed to be seen in any particular order by thread 0's and 1's relaxed loads.
The fact that POWER provides ordering in this case is coincidental. The following modified code guarantees ordering according to the C++ memory model:
r1 = x.load(memory_order_acquire); y.store(1, memory_order_release);
r2 = y.load(memory_order_acquire); r3 = x.load(memory_order_acquire);
x.store(1, memory_order_release);
assert(r2==0||r1==0||r3==1);
Since this code sequence adds memory barriers and does not remove any, the assert is never violated on Power.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_relaxed); y.store(1, memory_order_release);
r2 = y.load(memory_order_acquire); if (r2 != 0) x.store(1,memory_order_relaxed);
assert(r1==0);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 |
---|---|
r1=x; | r2=y; |
lwsync; | if (r2!=0) |
y=1; | isync; |
x=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
Because preceding loads and subsequent stores are applicable
to lwsync
, the ordering of these two operations
is guaranteed to be globally visible.
r1==1
, thread 0's load from x
has
returned the value stored by thread 1's store to x
,
and therefore thread 1's store is in the lwsync
's
A-set. This means that thread 1's store to x
is performed before thread 0's store to y
with
respect to all threads.
x
will not be
performed at all unless thread 1's load from y
returns 1.
isync
ensure
that thread 1's laod from y
is performed before
its store to x
, which in turn is performed before
thread 0's store to y
, which prevents thread 1's
load from returning 1.
r1
cannot have the value 1
.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 == 1) r2 = x.load(memory_order_relaxed);
assert(r1==0||r2==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 |
---|---|
x=1; | r1=y; |
lwsync; | if (r1==1) |
y=1; | isync; |
r2=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
:
y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load
from y
is performed after thread 0's
store to y
with respect to thread 1.
isync
means
that thread 1's load from y
is performed
before its load from x
with respect to
all threads.
x
is
performed before thread 1's load from x
.
r2==1
, satisfying the assert.
r2==0
, then the assert
is directly satisfied.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 == 1) z.store(1,memory_order_relaxed);
assert(z==0||x==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 |
---|---|
x=1; | r1=y; |
lwsync; | if (r1==1) |
y=1; | isync; |
z=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
:
y
synchronizes with thread 1's
load from y
, which means that thread 1's load
from y
is performed after thread 0's store to
y
with respect to thread 1.
Therefore, thread 0's store to x
is performed
before thread 1's load from y
with respect to
thread 1.
y
is performed before its store to z
with respect
to any processor.
x
is performed
before thread 1's store to z
with respect to
any processor.
x==1
, satisfying the assert.
r1==0
, then z
is never assigned to, so its value remains zero, which
also satisfies the assert.
The dependency-ordered-before examples involve one thread performing an evaluation A sequenced before a release operation on some atomic object M, concurrently with another thread performing an consume operation on this same atomic object M sequenced before an evaluation B that depends on the consume operation. These examples are the four cases where A and B are all combinations of relaxed loads and stores. Some of the examples will require a third thread, for example, the load/load case cannot detect ordering without an independent store.
Like Example 1, this example orders a pair of loads, but using dependency ordering rather than load-acquire. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = p.a.load(memory_order_acquire); y.store(&p, memory_order_release);
r2 = y.load(memory_order_consume); r3 = r2->a.load(memory_order_acquire);
p.a.store(1, memory_order_release);
assert(r2==&p||r1==0||r3==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
r1=p.a; | r2=y; | p.a=1; |
lwsync; | if (r2==&p) | lwsync; |
y=&p; | r3=r2->a; |
Note that the ordering instructions corresponding to thread 0's
load-acquire from p.a
are folded into the following
lwsync
instruction.
Note also that the ordering instructions corresponding to thread 1's
second load have no effect, and hence have been omitted.
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
Because preceding loads and subsequent stores are applicable
to lwsync
, the ordering of these two operations
is guaranteed to be globally visible.
r1==1
, then we know that, by cumulativity,
thread 2's store to p.a
is also in the
lwsync
's A-set.
r2==&p
, then we know that thread 0's
store to y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load from y
is performed after
thread 0's store to y
with respect to thread 1.
Therefore, thread 0's load from r2->a
is performed
before thread 2's store to p.a
with respect to
thread 1.
y
is performed before thread 1's
load from r2->a
with respect to all processors.
r1==1
and r2==1
, we know
that r3==1
so that the assert must always be satisfied.
(Note: we need not rely thread 1's load from r2->a
being in the lwsync
's B-set, which is good given
that prior stores and susequent loads are not applicable for
lwsync
.)
The C++ code for this example is as follows, with all variables initially zero:
r1 = p.a.load(memory_order_relaxed); y.store(&p, memory_order_release);
r2 = y.load(memory_order_consume); if (r2 == &p) r2->a.store(1,memory_order_relaxed);
assert(r1==0);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 |
---|---|
r1=p.a; | r2=y; |
lwsync; | if (r2==&p) |
y=&p; | r2->a=1; |
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that thread 0's load is performed before its store
with respect to any given processor.
r2==&p
, then we know that thread 0's
store to y
synchronizes with thread 1's
load from y
, which in turn means that thread 1's load
from y
is performed after thread 0's store to
y
with respect to thread 1.
p.a
is performed
before thread 1's store to r2->a
with respect to
thread 1.
Therefore, in this case, we have r1==0
.
r2!=&p
, then thread 1's
store to p.a
will not be executed,
so that r1==0
.
The C++ code for this example is as follows, with all variables initially zero:
p.a.store(1, memory_order_relaxed); y.store(&p, memory_order_release);
r1 = y.load(memory_order_consume); if (r1 == &p) r2 = r1->a.load(memory_order_relaxed);
assert(r1==&p||r2==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 |
---|---|
p.a=1; | r1=y; |
lwsync; | if (r1==&p) |
y=&p; | r2=r1->a; |
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to p.a
is performed before the store to y
with respect to any given thread.
r1==&p
:
y
synchronizes with thread 1's
load from y
, which in turn means
that thread 1's load from y
is performed
after thread 0's store to y
with respect
to thread 1.
y
will be performed before its load from r1->a
with respect to all threads.
p.a
is performed
before thread 1's load from r1->a
with respect to
thread 1.
r2==1
,
satisfying the assert.
r2==0
, then the assert
is directly satisfied.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:
p.a.store(1, memory_order_relaxed); y.store(&p, memory_order_release);
r1 = y.load(memory_order_consume); if (r1 == &p) r1->b.store(1,memory_order_relaxed);
assert(p.b==0||p.a==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 |
---|---|
p.a=1; | r1=y; |
lwsync; | if (r1==&p) |
y=&p; | r1->b=1; |
Discussion:
p.a
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to p.a
is performed before the store to y
with respect to any given thread.
r1==&p
:
y
synchronizes with thread 1's
load from y
, which in turn means that
thread 1's load
from y
is performed after thread 0's store
to y
with respect to thread 1.
y
is performed before
its store to r1->b
with respect to each
thread.
p.a
is
performed before thread 1's store to r1->b
with respect to each thread.
p.a==1
,
satisfying the assert.
r2==0
, then thread 2's
assignment to r1->b
is never executed, so that
p.b
is zero, again satisfying the assert.
The C++ memory model also provides for a “release sequences”, which comprise either (1) subsequent stores to the variable that was the subject of the release operation by the same thread that performed the release operation or (2) non-relaxed atomic read-modify-write operations on the variable that was the subject of the release operation by any thread. We consider these two release-sequence components separately in the following sections.
Subsequent same-thread stores are trivially accommodated. Simply add an additional atomic store (but possibly relaxed) operation after the release operation, and modify checks on the corresponding acquire operation to test for this subsequent store.
The same reasoning that worked for the original analyses applies straightforwardly to the updated examples.
This section reprises the examples in the “Synchronizes-With” section, but introducing an atomic read-modify-write operation.
This example shows how the C++ synchronization primitives can be used to order loads. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_acquire); y.store(2, memory_order_release);
r2 = y.load(memory_order_acquire); r3 = x.load(memory_order_acquire);
x.store(1, memory_order_release);
y.fetch_add(1, memory_order_acq_rel);
assert(r2<=1||r1==0||r3==1);
The corresponding PowerPC memory-ordering instructions are as shown in the following table:
Thread 0 | Thread 1 | Thread 2 | Thread 3 |
---|---|---|---|
r1=x; | r2=y; | x=1; | ldarx r4,y |
lwsync; | if (r2==r2) | r5=r4+1 | |
y=2; | isync; | stdcx. y,r5 | |
r3=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
These two operations are therefore ordered with respect to
each thread.
r1==1
(as required to violate the assertion),
then by A-cumulativity thread 2's store to x
is also
in the lwsync
's A-set.
r4==1
, then thread 3's stdcx.
is also in the lwsync
's B-set.
Therefore, thread 2's store to x
is ordered before
thread 3's stdcx.
to y
with respect
to each thread.
r2==1
case was handled in example 1, and is
not repeated here.
r2==2
, thread 1's load from y
was performed after thread 3's stdcx.
with respect
to thread 1.
r1==1
and
r2==2
, thread 0's store to x
is
performed before thread 1's load from y
with respect to all processors.
isync
,
thread 1's load from y
is performed before thread 1's
load from x
with respect to all processors.
r1==1
and r2==2
, we
know that r3==1
, so that the assert is satisfied.
(Note: we need not rely thread 1's load from x
being in the lwsync
's B-set.
This is fortunate given
that prior stores and subsequent loads are not applicable for
lwsync
.)
Note that this line of reasoning does not depend on any memory barriers
in the atomic read-modify-write operation, which means that this code
sequence would maintain the synchronizes-with relationship even when
using memory_order_relaxed
.
However, the standard does not require this behavior, so portable
code must use a non-relaxed memory-order specifier in this case.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
r1 = x.load(memory_order_relaxed); y.store(2, memory_order_release);
r2 = y.load(memory_order_acquire); if (r2 >= 2) x.store(1,memory_order_relaxed);
y.fetch_add(1, memory_order_acq_rel);
assert(r1==0);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
r1=x; | r2=y; | ldarx r3,y |
lwsync; | if (r2>=2) | r4=r3+1 |
y=2; | isync; | stdcx. y,r4 |
x=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the load is performed before the store
with respect to any given thread.
r2==1
was handled in example 2, and is
not repeated here.
r2==2
, we know that thread 2's ldarx
read the value stored by thread 0, and thread 2's stdcx.
is therefore in the lwsync
's B-set.
In addition, because thread 1's load from y
read the value
stored by thread 2's stdcx.
to y
,
applicable operations ordered after that load are therefore also in
the lwsync
's B-set.
isync
ensure
that the load into r2
is performed before the
store to x
with respect all other threads, so
that the store to x
is in the lwsync
's
B-set.
Thread 0's load from x
is thus always performed
before thread 1's store to x
with respect to each
thread.
r1
will always be zero, as thread 1's
store to x
is never performed unless thread 0's
load to r1
has already been performed.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 >= 2) r2 = x.load(memory_order_relaxed);
y.fetch_add(1, memory_order_acq_rel);
assert(r1<=1||r2==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
x=1; | r1=y; | ldarx r3,y |
lwsync; | if (r1>=2) | r4=r3+1 |
y=1; | isync; | stdcx. y,r4 |
r2=x; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This in turn means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
, we have the case covered in example 3,
which will not be further discussed here.
r1>=2
, we know that thread 2's atomic increment
intervened, so that thread 2's stdcx.
is in
lwsync
's B-set, so that this stdcx.
is performed after thread 0's store to x
with
respect to all threads.
y
returned the value
stored by thread 2's stdcx.
, we know that thread 2's
store to y
is performed before thread 2's load from
y
with respect to thread 1.
isync
ensure that
thread 1's load to r1
is performed before
its load to r2
with respect to each thread.
r1
is equal to two,
r2
is
guaranteed to return the value stored to x
by thread 0,
namely, the value 1.
The assertion is therefore always satisfied.
The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:
x.store(1, memory_order_relaxed); y.store(1, memory_order_release);
r1 = y.load(memory_order_acquire); if (r1 == 1) z.store(1,memory_order_relaxed);
y.fetch_add(1, memory_order_acq_rel);
assert(z==0||x==1);
The corresponding PowerPC memory-ordering instructions are as shown
in the following table, combining the conditional branch from the
acquire operation with the if
statement:
Thread 0 | Thread 1 | Thread 2 |
---|---|---|
x=1; | r1=y; | ldarx r2,y |
lwsync; | if (r1>=2) | r3=r2+1 |
y=1; | isync; | stdcx. y,r2 |
z=1; |
Discussion:
x
is in the lwsync
's
A-set and thread 0's store to y
is in the
lwsync
's B-set.
This means that the store to x
is performed before the store to y
with respect to any given thread.
r1==1
, we have the case dealt with in example 4,
which will not be discussed further.
r1==2
, we know that thread 2's atomic operation
intervened, so that thread 2's
stdcx.
is in the lwsync
's B-set.
Therefore, thread 0's store to x
is performed before
thread 2's stdcx.
to y
.
y
returned the
value stored by thread 2's stdcx.
, we know that
all of thread 1's subsequent applicable operations must see the values
stored by operations in the lwsync
's A-set.
isync
guarantee
that the load into r1
is performed before
its store to z
with respect to all threads.
z
is performed,
the value of x
must already be one, satisfying
the assert.
The key point in examples 9-12 was the recursive nature of B-cumulativity. This applies straightforwardly to the dependency ordering examples as well, so that Power allows a chain of atomic operations to form a release sequence, regardless of the memory_order argument.
PowerPC support for acquire and release bidirectional fences can be demonstrated by taking each of the examples above, and making the following transformations:
memory_order_release
fence, and note that the same sequence of
instructions is generated, thereby guaranteeing the
same ordering.
memory_order_acquire
fence, and note that this replaces the
conditional-branch-isync
sequence
with an lwsync
.
Because lwsync
is uniformly stronger,
this transformation guarantees the same ordering.
The PowerPC code sequences shown at the beginning of this document suffice to implement the synchronizes-with and dependency-carrying portions of the proposed C++ memory model. These code sequences are able to take advantage of the lightweight memory barriers provided by the PowerPC architecture.