Example POWER Implementation for C/C++ Memory Model

ISO/IEC JTC1 SC22 WG21 N2745 = 08-0255 - 2008-08-22 (REVISED)

Paul E. McKenney, paulmck@linux.vnet.ibm.com
Raul Silvera, rauls@ca.ibm.com

Introduction

This document presents a PowerPC implementation of the proposed C/C++ memory-order model, including the bidirectional fences proposed in N2731, and analyzes some representative non-SC (sequentially consistent) code sequences. Sequences involving both acquire and dependency-ordering operations are analyzed. This document is based on N2745.

The authors owe a debt of gratitude to Derek Williams, Cathy May, and Brad Frey for their careful and patient explanations of the PowerPC memory model. However, all mistakes within are the sole property of the authors. We are also thankful to Peter Dimov, Alexander Terekhov, and Herb Sutter for many helpful discussions.

PowerPC Code Sequences

There are any number of possible mappings from the C/C++ memory-ordering primitives onto the PowerPC instruction set. This document assumes the following mapping, originally derived by Raul:

Operation	PowerPC Implementation
Load Relaxed	`ld`
Load Consume	`ld`
Load Acquire	`ld; cmp; bc; isync`
Load Seq Cst	`hwsync; ld; cmp; bc; isync`
Store Relaxed	`st`
Store Release	`lwsync; st`
Store Seq Cst	`hwsync; st`
Cmpxchg Relaxed,Relaxed (32 bit)	`_loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:`
Cmpxchg Acquire,Relaxed (32 bit)	`_loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit:`
Cmpxchg Release,Relaxed (32 bit)	`lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit:`
Cmpxchg AcqRel,Relaxed (32 bit)	`lwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit`
Cmpxchg SeqCst,Relaxed (32 bit)	`hwsync; _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; isync; _exit`
Acquire Fence	`lwsync`
Release Fence	`lwsync`
AcqRel Fence	`lwsync`
SeqCst Fence	`hwsync`

Of course, compilers targeted to specific members of the PowerPC family may be able to make use of lighter-weight code sequences. In particular, a compiler targeting a uniprocessor-only PowerPC processor would be able to omit all of the above code, though such a compiler would still need to adhere to instruction-ordering constraints in a multiprogrammed environment. In addition, alternative code sequences may be used in some cases, for example, the twi instruction may be used in place of the cmp; bc sequences above.

The cmpxchg sequences above are for strong 32-bit operations. 64-bit cmpxchg sequences would substitute ldarx for lwarx and stdcx. for stwcx. Weak cmpxchg sequences could omit the backwards branch to _loop.

For convenience, the lock-acquisition sequence from Section B.2.1.1 of Book II of the PowerPC Architecture is adapted below:

loop: lwarx r6,0,r3 #load lock and reserve cmpw r4,r6 #(lock-free pattern in r4) bne- wait #if lock not free stwcx. loop #try to acquire lock bne- loop #if lost reservation isync #lightweight fence # begin critical section . . . wait: ... #wait for lock to become free

The code following the label wait would depend strongly on the enclosing software environment. Pure spinlocks would of course branch directly to loop rather than to wait. In some cases, implementors will prefer to use an immediate-mode compare rather than storing the lock-free pattern in a register.

Again for convenience, the lock-release sequence from Section B.2.2.1 of Book II of the PowerPC Architecture is adapted below:

sync #fence to force ordering stw r4,r3 #r4 contains lock-free pattern, #r3 contains address of lock word

If the critical section contains only normal cached memory operations, the lwsync barrier may be substituted for the heavier-weight sync instruction above. Alternatively, if non-cached memory operations such as MMIO operations are associated with memory barriers of their own, as is the case in the Linux kernel, then once again the lwsync barrier may be substituted for the heavier-weight sync instruction shown above. Finally, as with the lock-acquisition sequence, in some cases, implementors will prefer to use an immediate-mode store rather than storing the lock-free pattern in a register.

Relevant Wording From PowerPC Architecture

Modification Order and Memory Coherence

The C/C++ definition of “modification order” maps to the PowerPC notion of “memory coherence”. From Section 1.6.3 of PowerPC Book 2 describes memory coherence as follows:

An access to a Memory Coherence Required storage location is performed coherently, as follows.
Memory coherence refers to the ordering of stores to a single location. Atomic stores to a given location are coherent if they are serialized in some order, and no processor or mechanism is able to observe any subset of those stores as occurring in a conflicting order. This serialization order is an abstract sequence of values; the physical storage location need not assume each of the values written to it. For example, a processor may update a location several times before the value is written to physical storage. The result of a store operation is not available to every processor or mechanism at the same instant, and it may be that a processor or mechanism observes only some of the values that are written to a location. However, when a location is accessed atomically and coherently by all processor and mechanisms, the sequence of values loaded from the location by any processor or mechanism during any interval of time forms a subsequence of the sequence of values that the location logically held during that interval. That is, a processor or mechanism can never load a “newer” value first and then, later, load an “older” value.
Memory coherence is managed in blocks called coherence blocks. Their size is implementation-dependent (see the Book IV, PowerPC Implementation Features document for the implementation), but is larger than a word and is usually the size of a cache block.

Release Sequences

Release sequences include two cases, subsequent stores by the same thread that performed the release, and non-relaxed atomic read-modify-write operations. The subsequent-stores case is covered by the following sentence of Section 1.6.3 of PowerPC Book 2 quoted above:

That is, a processor or mechanism can never load a “newer” value first and then, later, load an “older” value.

In addition, we shall see that the operation of PowerPC memory barriers causes the subsequent-stores case to trivially follow from the case where the acquire operation loads the head of the release chain.

The non-relaxed-atomic case has not been carefully studied, but the authors are confident that use of either the lwsync or the hwsync instruction will suffice to enforce this ordering. A particular concern is the relationship of atomic operations and release sequences, which are explored in a later section.

The relevant wording from PowerPC Book 2 describing PowerPC's atomic read-modify-write sequences is as follows:

The store caused by a successful “stwcx.” is ordered, by a dependence on the reservation, with respect to the load caused by the “lwarx” that established the reservation, such that the two storage accesses are performed in program order with respect to any processor or mechanism.

Synchronizes With

The “synchronizes with” relation is handled by the placement of the memory barriers in the code sequences table above. The stores to object M follow a lwsync or hwsync instruction, and the corresponding loads precede one of a number of instruction sequences called out in the table. These instructions are described in Section 1.7.1 of PowerPC Book 2, as follows:

When a processor (P1) executes a sync, lwsync, or eieio instruction a memory barrier is created, which orders applicable storage accesses pairwise, as follows. Let A be a set of storage accesses that includes all storage accesses associated with instructions preceding the barrier-creating instruction, and let B be a set of storage accesses that includes all storage accesses associated with instructions following the barrier-creating instruction. For each applicable pair a_i,b_j of storage accesses such that a_i is in A and b_j is in B, the memory barrier ensures that a_i will be performed with respect to any processor or mechanism, to the extent required by the associated Memory Coherence Required attributes, before b_j is performed with respect to that processor or mechanism.

The word “performed” is defined roughly as follows:

A load operation has been performed with respect to a given CPU when that CPU is no longer able to change the value that is to be loaded.
A store operation has been performed with respect to a given CPU when a subsequent load by that CPU will return either the value store or some later value in the variable's modification order.

Section 1.7.1 of PowerPC Book 2 goes on to discuss cumulativity, which is somewhat similar to a causal ordering:

The ordering done by a memory barrier is said to be “cumulative” if it also orders storage accesses that are performed by processors and mechanisms other than P1, as follows.

A includes all applicable storage accesses by any such processor or mechanism that have been performed with respect to P1 before the memory barrier is created.
B includes all applicable storage accesses by any such processor or mechanism that are performed after a Load instruction executed by that processor or mechanism has returned the value stored by a store that is in B.

Note that B-cumulativity recurses on stores, as illustrated by the following sequence of operations:

Thread 0 executes a memory fence followed by a store to variable a. This store will be in the memory fence's B-set.
Thread 1 executes a load from a that returns the value stored by thread 0. Thread 1's load and all of thread 1's operations that are ordered after that load are in thread 0's memory fence's B-set.
Thread 1 executes a store to variable b that is ordered after the load from a (for example, by either code or data dependency). This store is therefore in Thread 0's memory fence's B-set.
Thread 2 executes a load from b that returns the value stored by thread 1. Thread 2's load and all of thread 1's operations that are ordered after that load are in thread 0's memory fence's B-set.

The recursive nature of B-cumulativity allows this sequence to be extended indefinitely.

In contrast, A-cumulativity has no such recursion. Only those operations performed with respect to the specific thread containing the memory fence (termed “memory barrier” in PowerPC documentation) will be in that memory fence's A-set. There is no A-cumulativity recursion through other threads' loads; such recursion is confined to B-cumulativity.

In both of the above passages from Section 1.7.1, the importance of the word “applicable” cannot be overstated. This word's meaning depends on the type of memory-fence instruction as follows:

eieio: only stores are “applicable”.
hwsync: both loads and stores are “applicable”. This instruction is often called sync, and provides full causal ordering (in fact, full sequential consistency).
lwsync applicability is limited to the following cases:
1. loads preceding and following.
2. stores preceding and following.
3. loads preceding and stores following.
Thus stores preceding and loads following are not applicable in the case of lwsync.
bc;isync: this is a very low-overhead and very weak form of memory fence. A specific set of preceding loads on which the bc (branch conditional) instruction depends are guaranteed to have completed before any subsequent instruction begins execution. However, store-buffer and cache-state effects can nevertheless make it appear that subsequent loads occur before the preceding loads upon which the twi instruction depends. That said, the PowerPC architecture does not permit stores to be executed speculatively, so any store following the twi;isync instruction is guaranteed to happen after any of the loads on which the bc depends.
Note that the bc;isync instruction sequence does not provide cumulativity. This permits the following counter-intuitive sequence of events, with all variables initially zero, and results of loads in square brackets following the load:
1. CPU 0: x=1
2. CPU 1: r1=x [1]
3. CPU 1: bc;isync
4. CPU 1: y=1
5. CPU 2: r2=y [1]
6. CPU 2: bc;isync
7. CPU 2: r3=x [0]
The problem is that the ordering between CPU 0's store and CPU 1's load is not globally visible, nor is the ordering between CPU 1's store and CPU 2's load. This sequence of events is more likely to occur on systems where CPUs 0 and 1 are closely related, for example, when CPUs 0 and 1 are hardware threads in one core and CPU 2 is a hardware thread in another core.

The effects of the isync instruction is described in the program note in Section 1.7.1 of PowerPC book 2:

Because an isync instruction prevents the execution of instructions following the isync until instructions preceding the isync have completed, if an isync follows a conditional Branch instruction that depends on the value returned by a preceding Load instruction, the load on which the Branch depends is performed before any loads caused by instructions following the isync. This applies even if the effects of the “dependency” are independent of the value loaded (e.g., the value is compared to itself and the Branch tests the EQ bit in the selected CR field), and even if the branch target is the sequentially next instruction.

This might seem at first glance to prohibit the above code sequence, however, please keep in mind that CPU 1's isync is not a cumulative memory barrier, and therefore does not guarantee that CPUs that see CPU 1's store to y will necessarily see CPU 0's store to x. Note that it is sufficient to replace CPU 1's bc;isync with an lwsync in order to prevent CPU 2's load from x from returning zero, but the proof cannot rely on B-cumulativity. B-cumulativity will not come into play here because prior stores (CPU 0's store to x) and subsequent loads (CPU 2's load from x) are not applicable to the lwsync instruction. Instead, we note that CPU 0's store to x will be performed before CPU 1's store to y with respect to all CPUs, due to CPU 1's lwsync instruction. Furthermore, CPU 2's bc;isync instruction ensures that CPU 2's load from y is performed before CPU 2's load from x with respect to all CPUs. Therefore, if CPU 2's load from y returns 1 (the new value), then CPU 2's subsequent load from x must also return 1 (the new value), given that the store to x was performed before the store to y.

However, if CPU 1's bc;isync were to be replaced with a hwsync, the proof can rely on simple B-cumulativity, because prior stores and subsequent loads are applicable in the case of hwsync.

Carrying Dependencies

N2556 defines carrying a dependency as follows:

An evaluation A carries a dependency to an evaluation B if

the value of A is used as an operand of B, and:

B is not an invocation of any specialization of std::kill_dependency, and

A is not the left operand to the comma (',') operator,

or

A writes a scalar object or bit-field M, B reads the value written by A from M, and A is sequenced before B, or

for some evaluation X, A carries a dependency to X, and X carries a dependency to B.

In cases where evaluation B is a load, we refer to Section 1.7.1 of PowerPC Book 2:

If a Load instruction depends on the value returned by a preceding Load instruction (because the value is used to compute the effective address specified by the second Load), the corresponding storage accesses are performed in program order with respect to any processor or mechanism to the extent required by the associated Memory Coherence Required attributes. This applies even if the dependency has no effect on program logic (e.g., the value returned by the first Load is ANDed with zero and then added to the effective address specified by the second Load).

Where evaluation B is a store, we refer to Section 4.2.4 of PowerPC Book 3:

Stores are not performed out-of-order (even if the Store instructions that caused them were executed out-of-order). Moreover, address translations associated with instructions preceding the corresponding Store instruction are not performed again after the store has been performed.

Examples

“Synchronizes-With” Examples

The synchronizes-with examples involve one thread performing an evaluation A sequenced before a release operation on some atomic object M, concurrently with another thread performing an acquire operation on this same atomic object M sequenced before an evaluation B. These examples are the four cases where A and B are all combinations of relaxed loads and stores.

Example 1: Load/Load

This example shows how the C++ synchronization primitives can be used to order loads. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:

Thread 0:

	r1 = x.load(memory_order_relaxed);
	y.store(1, memory_order_release);

Thread 1:

	r2 = y.load(memory_order_acquire);
	r3 = x.load(memory_order_relaxed);

Thread 2 (required for assertion):
```
	x.store(1, memory_order_relaxed);
	
```
Assertion: assert(r2==0||r1==0||r3==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table:

Thread 0	Thread 1	Thread 2
r1=x;	r2=y;	x=1;
lwsync;	if (r2==r2)
y=1;	isync;
	r3=x;

Discussion:

Thread 0's load from x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. Because lwsync orders prior loads against subsequent stores, this means that thread 0's load is performed before its store with respect to all threads.
In addition, given that r1==1, by A-cumulativity, thread 2's store to x is in the lwsync's A-set. Because lwsync orders stores, thread 2's store to x will be performed before thread 0's store to y with respect to all threads.
If r2==1, then we know that thread 0's store to y synchronizes with thread 1's load from y, which in turn means that thread 1's load from y is performed after thread 0's store to y with respect to thread 1.
The above conditions mean that if r1==1 and r2==1, thread 0's store to x is performed before thread 1's load from y with respect to thread 1.
Because of thread 1's conditional branch and isync, thread 1's load from y is performed before thread 1's load from x with respect to all processors.
Therefore, if r1==1 and r2==1, we know that r3==1, so that the assert is satisfied. (Note: we need not rely thread 1's load from x being in the lwsync's B-set. This is fortunate given that prior stores and subsequent loads are not applicable for lwsync.)

Note that the C++ memory model does not actually require the loads to be ordered in this case, since thread 0's relaxed store is not guaranteed to be seen in any particular order by thread 0's and 1's relaxed loads.

The fact that POWER provides ordering in this case is coincidental. The following modified code guarantees ordering according to the C++ memory model:

Thread 0:

	r1 = x.load(memory_order_acquire);
	y.store(1, memory_order_release);

Thread 1:

	r2 = y.load(memory_order_acquire);
	r3 = x.load(memory_order_acquire);

Thread 2 (required for assertion):
```
	x.store(1, memory_order_release);
	
```
Assertion: assert(r2==0||r1==0||r3==1);

Since this code sequence adds memory barriers and does not remove any, the assert is never violated on Power.

Example 2: Load/Store

The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:

Thread 0:

	r1 = x.load(memory_order_relaxed);
	y.store(1, memory_order_release);

Thread 1:

	r2 = y.load(memory_order_acquire);
	if (r2 != 0)
		x.store(1,memory_order_relaxed);

Assertion: assert(r1==0);

The corresponding PowerPC memory-ordering instructions are as shown in the following table, combining the conditional branch from the acquire operation with the if statement:

Thread 0	Thread 1
r1=x;	r2=y;
lwsync;	if (r2!=0)
y=1;	isync;
	x=1;

Discussion:

Thread 0's load from x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. Because preceding loads and subsequent stores are applicable to lwsync, the ordering of these two operations is guaranteed to be globally visible.
If r1==1, thread 0's load from x has returned the value stored by thread 1's store to x, and therefore thread 1's store is in the lwsync's A-set. This means that thread 1's store to x is performed before thread 0's store to y with respect to all threads.
However, thread 1's store to x will not be performed at all unless thread 1's load from y returns 1.
Thread 1's conditional branch and isync ensure that thread 1's laod from y is performed before its store to x, which in turn is performed before thread 0's store to y, which prevents thread 1's load from returning 1.
Therefore, r1 cannot have the value 1.

Example 3: Store/Load

The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:

Thread 0:

	x.store(1, memory_order_relaxed);
	y.store(1, memory_order_release);

Thread 1:

	r1 = y.load(memory_order_acquire);
	if (r1 == 1)
		r2 = x.load(memory_order_relaxed);

Assertion: assert(r1==0||r2==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table, combining the conditional branch from the acquire operation with the if statement:

Thread 0	Thread 1
x=1;	r1=y;
lwsync;	if (r1==1)
y=1;	isync;
	r2=x;

Discussion:

Thread 0's store to x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This in turn means that the store to x is performed before the store to y with respect to any given thread.
If r1==1:
1. Thread 0's store to y synchronizes with thread 1's load from y, which in turn means that thread 1's load from y is performed after thread 0's store to y with respect to thread 1.
2. The conditional branch and the isync means that thread 1's load from y is performed before its load from x with respect to all threads.
3. Therefore, thread 0's store to x is performed before thread 1's load from x.
We therefore have r2==1, satisfying the assert.
On the other hand, if r2==0, then the assert is directly satisfied.

Example 4: Store/Store

The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:

Thread 0:

	x.store(1, memory_order_relaxed);
	y.store(1, memory_order_release);

Thread 1:

	r1 = y.load(memory_order_acquire);
	if (r1 == 1)
		z.store(1,memory_order_relaxed);

Assertion: assert(z==0||x==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table, combining the conditional branch from the acquire operation with the if statement:

Thread 0	Thread 1
x=1;	r1=y;
lwsync;	if (r1==1)
y=1;	isync;
	z=1;

Discussion:

Thread 0's store to x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This means that the store to x is performed before the store to y with respect to any given thread.
If r1==1:
1. Thread 0's store to y synchronizes with thread 1's load from y, which means that thread 1's load from y is performed after thread 0's store to y with respect to thread 1. Therefore, thread 0's store to x is performed before thread 1's load from y with respect to thread 1.
2. Because stores are not performed speculatively, thread 1's load from y is performed before its store to z with respect to any processor.
3. Therefore, thread 0's store to x is performed before thread 1's store to z with respect to any processor.
We thus have x==1, satisfying the assert.
On the other hand, if r1==0, then z is never assigned to, so its value remains zero, which also satisfies the assert.

“Dependency-Ordered-Before” Examples

The dependency-ordered-before examples involve one thread performing an evaluation A sequenced before a release operation on some atomic object M, concurrently with another thread performing an consume operation on this same atomic object M sequenced before an evaluation B that depends on the consume operation. These examples are the four cases where A and B are all combinations of relaxed loads and stores. Some of the examples will require a third thread, for example, the load/load case cannot detect ordering without an independent store.

Example 5: Load/Load

Like Example 1, this example orders a pair of loads, but using dependency ordering rather than load-acquire. Consider the following C++ code, with all variables initially zero, and with the assertion executing after all threads have completed:

Thread 0:

	r1 = p.a.load(memory_order_acquire);
	y.store(&p, memory_order_release);

Thread 1:

	r2 = y.load(memory_order_consume);
	r3 = r2->a.load(memory_order_acquire);

Thread 2 (required for assertion):
```
	p.a.store(1, memory_order_release);
	
```
Assertion: assert(r2==&p||r1==0||r3==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table:

Thread 0	Thread 1	Thread 2
r1=p.a;	r2=y;	p.a=1;
lwsync;	if (r2==&p)	lwsync;
y=&p;	r3=r2->a;

Note that the ordering instructions corresponding to thread 0's load-acquire from p.a are folded into the following lwsync instruction. Note also that the ordering instructions corresponding to thread 1's second load have no effect, and hence have been omitted.

Discussion:

Thread 0's load from p.a is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. Because preceding loads and subsequent stores are applicable to lwsync, the ordering of these two operations is guaranteed to be globally visible.
If r1==1, then we know that, by cumulativity, thread 2's store to p.a is also in the lwsync's A-set.
If r2==&p, then we know that thread 0's store to y synchronizes with thread 1's load from y, which in turn means that thread 1's load from y is performed after thread 0's store to y with respect to thread 1. Therefore, thread 0's load from r2->a is performed before thread 2's store to p.a with respect to thread 1.
Because of dependency ordering, thread 1's load from y is performed before thread 1's load from r2->a with respect to all processors.
Therefore, if r1==1 and r2==1, we know that r3==1 so that the assert must always be satisfied. (Note: we need not rely thread 1's load from r2->a being in the lwsync's B-set, which is good given that prior stores and susequent loads are not applicable for lwsync.)

Example 6: Load/Store

The C++ code for this example is as follows, with all variables initially zero:

Thread 0:

	r1 = p.a.load(memory_order_relaxed);
	y.store(&p, memory_order_release);

Thread 1:

	r2 = y.load(memory_order_consume);
	if (r2 == &p)
		r2->a.store(1,memory_order_relaxed);

Assertion: assert(r1==0);

The corresponding PowerPC memory-ordering instructions are as shown in the following table:

Thread 0	Thread 1
r1=p.a;	r2=y;
lwsync;	if (r2==&p)
y=&p;	r2->a=1;

Discussion:

Thread 0's load from p.a is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This in turn means that thread 0's load is performed before its store with respect to any given processor.
If r2==&p, then we know that thread 0's store to y synchronizes with thread 1's load from y, which in turn means that thread 1's load from y is performed after thread 0's store to y with respect to thread 1.
Power processors do not perform stores speculatively. Therefore, thread 0's load from p.a is performed before thread 1's store to r2->a with respect to thread 1. Therefore, in this case, we have r1==0.
On the other hand, if r2!=&p, then thread 1's store to p.a will not be executed, so that r1==0.

Example 7: Store/Load

The C++ code for this example is as follows, with all variables initially zero:

Thread 0:

	p.a.store(1, memory_order_relaxed);
	y.store(&p, memory_order_release);

Thread 1:

	r1 = y.load(memory_order_consume);
	if (r1 == &p)
		r2 = r1->a.load(memory_order_relaxed);

Assertion: assert(r1==&p||r2==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table:

Thread 0	Thread 1
p.a=1;	r1=y;
lwsync;	if (r1==&p)
y=&p;	r2=r1->a;

Discussion:

Thread 0's store to p.a is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This in turn means that the store to p.a is performed before the store to y with respect to any given thread.
If r1==&p:
1. Thread 0's store to y synchronizes with thread 1's load from y, which in turn means that thread 1's load from y is performed after thread 0's store to y with respect to thread 1.
2. Dependency ordering means that thread 1's load from y will be performed before its load from r1->a with respect to all threads.
3. Therefore, thread 0's store to p.a is performed before thread 1's load from r1->a with respect to thread 1.
Therefore, in this case, we have r2==1, satisfying the assert.
On the other hand, if r2==0, then the assert is directly satisfied.

Example 8: Store/Store

The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:

Thread 0:

	p.a.store(1, memory_order_relaxed);
	y.store(&p, memory_order_release);

Thread 1:

	r1 = y.load(memory_order_consume);
	if (r1 == &p)
		r1->b.store(1,memory_order_relaxed);

Assertion: assert(p.b==0||p.a==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table:

Thread 0	Thread 1
p.a=1;	r1=y;
lwsync;	if (r1==&p)
y=&p;	r1->b=1;

Discussion:

Thread 0's store to p.a is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This in turn means that the store to p.a is performed before the store to y with respect to any given thread.
If r1==&p:
1. Thread 0's store to y synchronizes with thread 1's load from y, which in turn means that thread 1's load from y is performed after thread 0's store to y with respect to thread 1.
2. Because Power does not do speculative stores, thread 1's load from y is performed before its store to r1->b with respect to each thread.
3. Therefore, thread 0's store to p.a is performed before thread 1's store to r1->b with respect to each thread.
Therefore, in this case, we have p.a==1, satisfying the assert.
On the other hand, if r2==0, then thread 2's assignment to r1->b is never executed, so that p.b is zero, again satisfying the assert.

Release-Sequence Examples

The C++ memory model also provides for a “release sequences”, which comprise either (1) subsequent stores to the variable that was the subject of the release operation by the same thread that performed the release operation or (2) non-relaxed atomic read-modify-write operations on the variable that was the subject of the release operation by any thread. We consider these two release-sequence components separately in the following sections.

Release-Sequence Same-Thread Examples

Subsequent same-thread stores are trivially accommodated. Simply add an additional atomic store (but possibly relaxed) operation after the release operation, and modify checks on the corresponding acquire operation to test for this subsequent store.

The same reasoning that worked for the original analyses applies straightforwardly to the updated examples.

Release-Sequence Atomic-Operation “Synchronizes-With” Examples

This section reprises the examples in the “Synchronizes-With” section, but introducing an atomic read-modify-write operation.

Example 9: Load/Load

Thread 0:

	r1 = x.load(memory_order_acquire);
	y.store(2, memory_order_release);

Thread 1:

	r2 = y.load(memory_order_acquire);
	r3 = x.load(memory_order_acquire);

Thread 2 (required for assertion):
```
	x.store(1, memory_order_release);
	
```
Thread 3 (adds atomic read-modify-write operation):
```
	y.fetch_add(1, memory_order_acq_rel);
	
```
Assertion: assert(r2<=1||r1==0||r3==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table:

Thread 0	Thread 1	Thread 2	Thread 3
r1=x;	r2=y;	x=1;	ldarx r4,y
lwsync;	if (r2==r2)		r5=r4+1
y=2;	isync;		stdcx. y,r5
	r3=x;

Discussion:

Thread 0's load from x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. These two operations are therefore ordered with respect to each thread.
If r1==1 (as required to violate the assertion), then by A-cumulativity thread 2's store to x is also in the lwsync's A-set.
If r4==1, then thread 3's stdcx. is also in the lwsync's B-set. Therefore, thread 2's store to x is ordered before thread 3's stdcx. to y with respect to each thread.
The r2==1 case was handled in example 1, and is not repeated here.
If r2==2, thread 1's load from y was performed after thread 3's stdcx. with respect to thread 1.
The above conditions mean that if r1==1 and r2==2, thread 0's store to x is performed before thread 1's load from y with respect to all processors.
Because of thread 1's conditional branch and isync, thread 1's load from y is performed before thread 1's load from x with respect to all processors.
Therefore, if r1==1 and r2==2, we know that r3==1, so that the assert is satisfied. (Note: we need not rely thread 1's load from x being in the lwsync's B-set. This is fortunate given that prior stores and subsequent loads are not applicable for lwsync.)

Note that this line of reasoning does not depend on any memory barriers in the atomic read-modify-write operation, which means that this code sequence would maintain the synchronizes-with relationship even when using memory_order_relaxed. However, the standard does not require this behavior, so portable code must use a non-relaxed memory-order specifier in this case.

Example 10: Load/Store

The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:

Thread 0:

	r1 = x.load(memory_order_relaxed);
	y.store(2, memory_order_release);

Thread 1:

	r2 = y.load(memory_order_acquire);
	if (r2 >= 2)
		x.store(1,memory_order_relaxed);

Thread 2 (adds atomic read-modify-write operation):
```
	y.fetch_add(1, memory_order_acq_rel);
	
```
Assertion: assert(r1==0);

The corresponding PowerPC memory-ordering instructions are as shown in the following table, combining the conditional branch from the acquire operation with the if statement:

Thread 0	Thread 1	Thread 2
r1=x;	r2=y;	ldarx r3,y
lwsync;	if (r2>=2)	r4=r3+1
y=2;	isync;	stdcx. y,r4
	x=1;

Discussion:

Thread 0's load from x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This in turn means that the load is performed before the store with respect to any given thread.
The case r2==1 was handled in example 2, and is not repeated here.
If r2==2, we know that thread 2's ldarx read the value stored by thread 0, and thread 2's stdcx. is therefore in the lwsync's B-set. In addition, because thread 1's load from y read the value stored by thread 2's stdcx. to y, applicable operations ordered after that load are therefore also in the lwsync's B-set.
Thread 1's conditional branch and isync ensure that the load into r2 is performed before the store to x with respect all other threads, so that the store to x is in the lwsync's B-set. Thread 0's load from x is thus always performed before thread 1's store to x with respect to each thread.
Therefore, r1 will always be zero, as thread 1's store to x is never performed unless thread 0's load to r1 has already been performed.

Example 11: Store/Load

The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing after all threads have completed:

Thread 0:

	x.store(1, memory_order_relaxed);
	y.store(1, memory_order_release);

Thread 1:

	r1 = y.load(memory_order_acquire);
	if (r1 >= 2)
		r2 = x.load(memory_order_relaxed);

Thread 2 (adds atomic read-modify-write operation):
```
	y.fetch_add(1, memory_order_acq_rel);
	
```
Assertion: assert(r1<=1||r2==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table, combining the conditional branch from the acquire operation with the if statement:

Thread 0	Thread 1	Thread 2
x=1;	r1=y;	ldarx r3,y
lwsync;	if (r1>=2)	r4=r3+1
y=1;	isync;	stdcx. y,r4
	r2=x;

Discussion:

Thread 0's store to x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This in turn means that the store to x is performed before the store to y with respect to any given thread.
If r1==1, we have the case covered in example 3, which will not be further discussed here.
If r1>=2, we know that thread 2's atomic increment intervened, so that thread 2's stdcx. is in lwsync's B-set, so that this stdcx. is performed after thread 0's store to x with respect to all threads.
Because thread 1's load from y returned the value stored by thread 2's stdcx., we know that thread 2's store to y is performed before thread 2's load from y with respect to thread 1.
Thread 1's conditional branch and isync ensure that thread 1's load to r1 is performed before its load to r2 with respect to each thread.
Therefore, given that r1 is equal to two, r2 is guaranteed to return the value stored to x by thread 0, namely, the value 1. The assertion is therefore always satisfied.

Example 12: Store/Store

The C++ code for this example is as follows, with all variables initially zero, and with the assertion executing at any time:

Thread 0:

	x.store(1, memory_order_relaxed);
	y.store(1, memory_order_release);

Thread 1:

	r1 = y.load(memory_order_acquire);
	if (r1 == 1)
		z.store(1,memory_order_relaxed);

Thread 2 (adds atomic read-modify-write operation):
```
	y.fetch_add(1, memory_order_acq_rel);
	
```
Assertion: assert(z==0||x==1);

The corresponding PowerPC memory-ordering instructions are as shown in the following table, combining the conditional branch from the acquire operation with the if statement:

Thread 0	Thread 1	Thread 2
x=1;	r1=y;	ldarx r2,y
lwsync;	if (r1>=2)	r3=r2+1
y=1;	isync;	stdcx. y,r2
	z=1;

Discussion:

Thread 0's store to x is in the lwsync's A-set and thread 0's store to y is in the lwsync's B-set. This means that the store to x is performed before the store to y with respect to any given thread.
If r1==1, we have the case dealt with in example 4, which will not be discussed further.
If r1==2, we know that thread 2's atomic operation intervened, so that thread 2's stdcx. is in the lwsync's B-set. Therefore, thread 0's store to x is performed before thread 2's stdcx. to y.
Given that thread 1's load from y returned the value stored by thread 2's stdcx., we know that all of thread 1's subsequent applicable operations must see the values stored by operations in the lwsync's A-set.
Thread 1's conditional branch and isync guarantee that the load into r1 is performed before its store to z with respect to all threads.
This means that if the store to z is performed, the value of x must already be one, satisfying the assert.

Release-Sequence Atomic-Operation “Dependency-Ordered-Before” Examples

The key point in examples 9-12 was the recursive nature of B-cumulativity. This applies straightforwardly to the dependency ordering examples as well, so that Power allows a chain of atomic operations to form a release sequence, regardless of the memory_order argument.

Bidirectional Fences

PowerPC support for acquire and release bidirectional fences can be demonstrated by taking each of the examples above, and making the following transformations:

Transform store-release operations into the equivalent bidirectional memory_order_release fence, and note that the same sequence of instructions is generated, thereby guaranteeing the same ordering.
Transform load-acquire operations into the equivalent bidirectional memory_order_acquire fence, and note that this replaces the conditional-branch-isync sequence with an lwsync. Because lwsync is uniformly stronger, this transformation guarantees the same ordering.

Summary

The PowerPC code sequences shown at the beginning of this document suffice to implement the synchronizes-with and dependency-carrying portions of the proposed C++ memory model. These code sequences are able to take advantage of the lightweight memory barriers provided by the PowerPC architecture.