September 30, 2016
This article was contributed by Jade Alglave,
Paul E. McKenney, Alan Stern, Luc Maranget, Andrea Parri and TBD
Introduction
This article is organized as follows, with the intended audience
for each section in parentheses:
-
Introduction to the Linux-Kernel Memory Models
(people interested in understanding the memory model).
- Strong-Model Bell File
(masochists and other people interested in a deep understanding
of the Linux-kernel memory model).
- Strong-Model Cat File
(masochists and other people interested in a deep understanding
of the Linux-kernel memory model).
- (More TBD.)
This is followed by the inevitable
answers to the quick quizzes.
This section is mostly concerned with the strong memory model.
The other, less strong (we hesitate to call it “weak”)
model is derived from the strong one by relaxing several of the
less-important constraints.
The strong Linux-kernel memory model started out as an operational
model, based on the PPCMEM model for PowerPC as presented in
two papers (“Understanding POWER Multiprocessors”
[pdf]
and “Synchronising C/C++ and POWER”
[pdf])
by Susmit Sarkar, Peter Sewell, and others.
Our model was a modified version of theirs, changed to take into account
the requirements of the kernel.
herd-style Bell and Cat files
were developed as a formal axiomatization of this model.
The model then was modified to handle the peculiarities of the DEC Alpha,
and the Cat file was modified accordingly.
Some time later we incorporated ideas from the
Flowing and POP models for ARM, as presented in “Modelling the
ARMv8 Architecture, Operationally: Concurrency and ISA”
[pdf]
by Shaked Flur, Peter Sewell, and others
(together with the supplementary material
[pdf]).
This design proved to be so different from the PowerPC-oriented
operational model that there was no reasonable way to unify the two.
Instead, we abandoned our operational model and concentrated on the
formal herd model, weakening it so that it would accept litmus
tests allowed by the ARM architecture as defined by the Flowing model.
Nevertheless, the original operational model offers a very good basis
for understanding our formal model and so we present it here,
along with a discussion of the changes needed to adapt it to Alpha and
the issues raised by ARM.
The operational model divides a computer system into two parts:
the processors (or CPUs), which execute instructions,
and the memory subsystem,
which propagates information about writes and barriers among the
CPUs and is also responsible for determining the coherence order.
When a CPU executes a write or certain kinds of barriers,
it tells the memory subsytem.
And when a CPU needs to load a value from memory or cache to execute a read,
it asks the memory subsystem to provide the value.
The Processor Subsystem
Although the underlying operations involved in executing an instruction
on a modern CPU can be quite complicated,
nevertheless there always comes a point where the CPU has finished
evaluating all of the instruction's inputs and outputs and it commits
itself irrevocably to using those values.
Conceptually, each instruction that gets executed is committed
at a single, precise moment in time.
(Instructions that don't get executed, such as those started speculatively
in what turns out to be the unused arm of a conditional branch,
are not committed.)
Instructions may commit in any order and at any rate, subject to certain
constraints.
For example, an instruction controlled by a conditional branch can't
be committed before the branch itself, because until that time
the CPU doesn't know for certain which way the branch will go.
The full set of constraints on the order of instruction execution
(which is almost but not quite the same as comittal, differing only
for read instructions) is listed
below.
For instructions involving only quantities that are local to the CPU,
such as those computing register-to-register arithmetic,
that's all there is to it:
The CPU carries out the operations required
by the instruction, eventually commits to the result, and moves on.
But some instructions need more.
In particular, some require the CPU to communicate with the memory subsystem.
Writes and memory barriers are the simplest case.
When a CPU commits a write instruction, it tells the memory subsystem
the target address of the write and the value to be stored there.
It can't do this before the write commits, because once the information
has been sent to the memory subsystem there's no way to take it back.
Similarly, when a CPU commits one of the barriers that affect
write-propagation order, it informs the memory subsystem, which then
uses that information to control the way writes get propagated.
Reads are more complicated.
When a CPU starts to execute a read instruction,
it first has to calculate the target address, which may involve
adding index or base register values to a constant offset.
It then checks to see if the most recent write (in program order)
to that target address is still uncommitted;
if it is then the CPU takes the value to be stored by that write
and uses it as the value for the read.
This is called store forwarding, and it is a form of out-of-order
execution (the read can be committed before the program-earlier write).
But if there was no prior write to that address or the most recent one
has already been committed, then the CPU has to
ask the memory subsystem to retrieve the value at the target address.
Either way, we say that the read is satisfied,
and this also takes place at a precise moment in time.
A read instruction cannot commit until it has been satisfied.
There's more to it than that, however.
The act of satisfying a read is not irrevocable.
It may turn out, for example, that the values used in calculating the
target address were themselves not yet committed and hence are still
subject to change.
If that happens, the read instruction may need to be restarted:
The target address must be recalculated and the read must be satisfied again.
This can happen several times before the read is committed.
In fact, it can even happen several times without the read ever being
committed, if the read was started speculatively and then abandoned.
Of course, once a read has been committed then it can no longer be restarted.
Thus, a CPU carries out a read instruction by satisfying it
(perhaps more than once) and eventually committing it.
For most other instruction types, execution only involves committing the
instruction, but there is one exception.
A strong memory barrier (such as smp_mb())
is not finished when it commits.
Instead, the CPU has to wait for the strong barrier to be
acknowledged by the memory subsystem.
This doesn't happen until the memory subsystem has propagated the
barrier to all the other CPUs in the system,
and the CPU is not allowed to begin executing any instructions that
come after the strong barrier in program order until then.
This is what makes these barriers so strong (and so slow!).
The Memory Subsystem
The memory subsystem accepts write, read, and barrier requests from the CPUs.
It propagates the write and barrier requests to all the CPUs in the system
and satisfies read requests.
It also determines the coherence order of all writes to each variable
and provides a mechanism for making certain operations atomic.
Handling read requests is quite simple.
When a CPU submits a read request for a specified target address,
the memory subsystem finds the latest write
(in the target address's coherence order) that has propagated to the
CPU and returns the value stored by that write.
This means, among other things, that a CPU cannot read from a write
until the write has propagated to that CPU, as you would expect.
It's important that the write be the coherence-latest;
otherwise the system could violate the read-read
coherence rule
if a po-earlier read had already read from the coherence-latest write.
Accepting a write request from a CPU is a little more complicated.
To begin with, the memory subsystem has to decide where the write
will fit into the coherence order for the target address.
In particular, it must ensure that the write is assigned to a position
in the coherence order that is after any other writes to the same address
which have already propagated to that CPU.
This is necessary because the CPU might have read from one of those
other writes; if the new write were to be come before that other write
in the coherence order then there would be a violation of the read-write
coherence rule.
In addition, the memory subsystem has to
propagate the write to all the other CPUs
(a write or barrier is considered to “propagate” to its own CPU
at the time it is committed) and to the coherence point.
The coherence point is a notional place in the system where
writes and barriers get sent,
in much the same way that they are propagated to CPUs.
(If you like, you can think of the coherence point as being the place
where writes finally pass out of all the internal caches and buffers,
down into memory for storage.)
The key aspect of the coherence point is that different writes
to the same address arrive at the coherence point in their coherence order.
In effect, the order of their arrival at the coherence point defines
the coherence order.
Whether this is because the memory subsystem first decides on a
coherence order and then sends writes to the coherence point in that
order, or because it sends writes willy-nilly to the coherence point
and then uses the order of their arrival as the coherence order,
doesn't matter.
What does matter is that once a write has reached the coherence point,
its position in the coherence order is fixed; it is impossible for any
future writes to be assigned an earlier position in the coherence order.
This fact is crucial for atomic operations.
The memory model represents an atomic read-modify-write (RMW) operation as
two events: a read R followed by a write W
(or conditionally followed by a write, for operations like cmpxchg()).
What makes the operation atomic is that no other writes to that address,
from any CPU, are allowed to intervene between R and W.
In other words, the memory subsystem guarantees that the write immediately
preceding W in the coherence order is the write that R
reads from.
The operational model specifies that it does this, in part, by arranging for
W to reach the coherence point at the time when it commits
(as opposed to some arbitrarily later time, like an ordinary write).
As a result, no future write will be able to sneak in before W
in the coherence order, and avoiding such “sneak writes”
preserves the atomic property.
Other than these requirements and the constraints imposed by memory barriers,
the order in which writes are propagated to CPUs and reach the coherence point
is unrestricted.
In particular, these orders don't have to bear any resemblance to the order
in which the write requests were originally sent to the memory subsystem.
It's entirely possible for two CPUs to write to the same address at
different times and have the second write come before the first in the
coherence order or be propagated before the first to a third CPU.
The kernel memory model has two broad categories of memory barriers:
those whose effects are entirely local to a single CPU and those that
interact with the memory subsystem.
Barriers in the first category can only constrain the order in which
the CPU executes instructions, whereas those in the second group
can also affect the order of propagation of writes.
Associated with each memory barrier are two sets of instructions,
called the barrier's pre-set and its post-set.
These sets vary according to the type of barrier, but all barriers
share these features:
- A barrier cannot commit until every instruction in its pre-set
has committed, and
- An instruction in the barrier's post-set cannot commit or be
satisfied until the barrier has committed.
The various barriers included in this memory model,
and their types and pre- and post-sets, are listed
in the following table in order of increasing strength.
(Contrary to what you might expect, smp_load_acquire(),
rcu_dereference(), and lockless_dereference()
are represented in the memory subsystem as a read request followed by a
separate barrier request.
Similarly, smp_store_release and rcu_assign_pointer()
are represented as a barrier request followed by a separate write request.
Thus, each requires the CPU to issue two requests to the memory subsystem.)
Barrier |
Type |
Pre-set |
Post-set |
rcu_dereference(),
lockless_dereference() |
Read-dependency |
itself |
all po-later reads with a dependency from this read |
smp_read_barrier_depends() |
Read-dependency |
all po-earlier reads |
all po-later reads with a dependency from a po-earlier read |
smp_load_acquire() |
Execution-order |
itself |
all po-later memory accesses |
smp_rmb() |
Execution-order |
all po-earlier reads |
all po-later reads |
smp_wmb() |
B-cumulative |
all po-earlier writes |
all po-later writes (*) |
smp_store_release(),
rcu_assign_pointer() |
A-cumulative (**) |
all po-earlier memory accesses (*) |
itself and members of its release sequence |
smp_mb(),
synchronize_rcu() |
Strong (A- and B-cumulative) |
all po-earlier memory accesses (*) |
all po-later memory accesses (*) |
(*) as modified by the cumulativity requirements described below.
(**) also B-cumulative in combination with an acquire load that returns the
value stored, as described
here.
The read-dependency and execution-order barriers are purely local to their
own CPU.
However, when the CPU commits one of the others
(collectively referred to as “propagation-order” barriers),
it informs the memory subsystem about the barrier, and the memory subsystem
propagates the barrier to all the other CPUs.
This is where the barrier's propagation ordering effects come into play:
- The memory subsystem will not propagate a barrier to a CPU until
all the writes in the barrier's pre-set have been propagated to
that CPU; and
- The memory subsystem will not propagate a write in a barrier's
post-set to a CPU until the barrier has been propagated to that CPU.
The same is true for the order in which writes and barriers reach the
coherence point; in this respect barriers treat the coherence point
much like another CPU.
In addition, the memory subsystem does not
acknowledge a strong barrier
until the barrier has been propagated to every CPU
and has reached the coherence point
(and as mentioned above,
the CPU will not satisfy or commit any instructions po-after a strong
barrier until the barrier has been acknowledged).
The propagation-order barriers enjoy varying degrees
of cumulativity.
This means that the barriers affect the order of propagations, not just
of writes issued by the barrier's own CPU, but also of writes issued
by other CPUs.
In effect, the barriers' pre- and post-sets are enlarged:
- The pre-set of an A-cumulative barrier also includes
all writes that have propagated to the barrier's CPU before
the barrier is committed.
- The post-set of a B-cumulative barrier also includes
all writes on another CPU that commit after the barrier has
propagated to that CPU.
The memory model includes the idea of release sequences,
borrowed (and slightly altered) from C11.
For any store-release instruction (such as smp_store_release(),
rcu_assign_pointer(), or xchg_release()),
the release sequence headed
by that instruction includes the instruction itself as well as
all writes to the same address that come after it in program order.
The release sequence also includes, recursively,
any atomic RMW operation accessing the same address,
on any CPU, that reads from a write in the release sequence.
Every write in the release sequence belongs to the
associated barrier's post-set.
Some examples of cumulativity and release sequences are presented
below.
To see how this works out in practice, consider this litmus test,
an example of the “Store-Buffering” pattern:
Strong Model Litmus Test #1
1 C C-SB+o-mb-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 int r1;
9
10 WRITE_ONCE(*x, 1);
11 smp_mb();
12 r1 = READ_ONCE(*y);
13 }
14
15 P1(int *x, int *y)
16 {
17 int r2;
18
19 WRITE_ONCE(*y, 1);
20 smp_mb();
21 r2 = READ_ONCE(*x);
22 }
23
24 exists
25 (0:r1=0 /\ 1:r2=0)
Using the features of the memory model we have already seen, we can
show that this test's “exists” condition will never be satisfied.
When the test is executed, one of the two memory barriers must be
acknowledged before the other (they are both strong barriers).
Suppose the barrier in P0 gets acknowledged first.
Then the following events have to occur in the order listed, for the
reasons shown:
- P0's write to x propagates to P1 before P0's memory barrier
does, because the write is in the barrier's pre-set.
- P0's memory barrier propagates to P1 before it is acknowledged,
because a strong memory barrier is not acknowledged until
it has propagated to every CPU.
- P0's memory barrier is acknowledged before P1's barrier, by
assumption.
- P1's read of x is satisfied after P1's barrier is
acknowledged, because the read comes after the barrier
in program order.
Hence the write to
x propagates to P1 before P1's read is
satisfied.
Since that write is the last one in the coherence order for
x,
it is the one that will be used to satisfy the read.
Therefore
r2 will end up equal to 1, not 0.
The opposite case (where P1's barrier is acknowledged first) is
symmetrical.
In neither case is it possible for
r1 and
r2 both to
be equal to 0.
Processor-Local Ordering Requirements
While executing instructions, a CPU observes various ordering requirements.
Some of these are obvious (an instruction can't be executed before the CPU
knows how or whether to execute it, as mentioned
earlier).
Others are less obvious but are necessary to avoid violating the four
coherence rules.
The first and simplest requirement is that a CPU will not commit
an instruction that is po-after a conditional branch
until the branch itself is committed.
At the hardware level this is true even for trivial conditionals;
CPUs do not recognize that expressions like “x == x”
must always hold.
The situation in higher-level languages is not as simple, because
optimizing compilers do recognize such things.
They will happily eliminate the unnecessary test and conditional jump
from the object code entirely,
leaving no ordering requirement for the CPU to obey at runtime.
Compilers also realize that code following the end of an
“if (...) then {...} else {...}” statement
will be executed regardless of which branch is taken,
and they are free to move such code up before the start of the
“if” in the absence of any reason not to.
Therefore our memory model applies this ordering requirement
only to the instructions
that are within the “then {...} else {...}”
branches of the conditional, i.e., those that are directly under its control.
These instructions cannot be committed before the instructions that compute
the “if” condition and decide whether to take the branch.
The memory model has a weakness in this area.
If both branches of an “if” statement
write the same value to the same variable,
an optimizing compiler may replace both writes with a single write
that is executed before the “if” statement branches.
This resulting single write then would not be subject to the ordering
requirement at runtime, even though the model says it should be.
(We must emphasize that this requirement applies only to committing
instructions. Reads following a conditional branch can be satisfied
before the branch is committed, and they often are.
Even if a read belongs to the arm of the branch that is ultimately not taken,
the CPU may speculatively start executing the read before it knows
which way the branch will go.
Although all architectures have barriers that can prevent speculative reads,
such as the isb and isync instructions on ARM and PowerPC,
respectively, the Linux kernel does not include any facility
specifically meant for this purpose.
If you really want to prevent a read in a conditional branch
from being satisfied speculatively, you can always use smp_rmb().)
The preceding requirement was about whether a CPU should execute
an instruction.
The next ordering requirement is about how an instruction should
be executed.
If an instruction has a
dependency
from a po-earlier read, then the instruction cannot commit until the
read does.
This is simply because the address or data the instruction will use isn't
irrevocable until the earlier read commits.
The next pair of requirements affect only writes.
A write cannot commit until all sources of address dependencies to a
memory-access instruction po-before the write have committed.
This is necessary because the po-earlier memory access might generate
an invalid-address exception, in which case the write should not be executed.
However, the CPU can't know whether the earlier access's target
address will turn out to be invalid until the target address is fully
determined, which means that all sources of an address dependency have
to be committed.
Furthermore, a write cannot commit until all po-earlier instructions
accessing the same address have committed.
This requirement is necessary to enforce the coherence rules.
If two writes to the same address committed and were sent to the memory
subsystem out of order,
the memory subsystem would put the po-earlier write later in the
address's coherence order, which would violate write-write coherence.
And if a write committed before a po-earlier read of the same address
was satisfied, memory subsystem would use the value stored by the write
(or something even later in the coherence order) to satisfy the
read request, which would violate read-write coherence.
But since the po-earlier read may be restarted (and thus satisfied again)
at any time up until it commits,
this means the write must not commit until the read is committed.
The next three requirements are concerned with restarting reads.
A read instruction R must be restarted after:
- A po-earlier read R' is satisfied, where R' is
the source of an address dependency to R, or the source
of an address or data dependency to a write that was forwarded
to R;
- A po-earlier read R' of the same address is satisfied,
unless R read from the same write as R' or
was forwarded from a write that is po-after R'; or
- A po-earlier write W to the same address is committed,
unless R was forwarded from W or
from a write that is po-after W.
It follows that
R cannot commit until each of the
R'
or
W
accesses mentioned here has committed (because until then,
R'
might restart and thus be satisfied again or
W might commit,
requiring
R to restart).
The reason for the first case is pretty obvious; the other two
are more obscure.
For the second case,
suppose R' is po-before R and they read from different
writes, W' and W respectively.
Assuming W was not forwarded to R, this means that
W was the most recent write in the coherence order to have
propagated to the CPU as of the time when R was satisfied.
Similarly, either R' was forwarded from W' or else
W' was the most recent write in the coherence order to have
propagated to the CPU as of the time when R' was satisfied.
But if R' was satisfied after R, then
W' must come after W in the coherence order—if
it came before then R' would have read from W instead
of W'.
(We can discount the possibility that W' was forwarded to
R'; if it had been then it or a write po-after R'
would been forwarded to R.)
This would be a violation of read-read coherence.
Thus R' must be satisfied before R, and the only
way to guarantee this is to require that R be restarted after
R' is satisfied.
For the third case, suppose W was the last write before
R (in program order) to access the same address,
and R was satisfied before W committed but
was not forwarded from it.
(This could happen if W's target address had not yet been
been determined at the time R was satisfied.)
Then R would have either read from some other write W'
which had already propagated to the CPU, or else been forwarded from
some other write W' that was po-earlier than W.
Either way, when W did commit later on,
it would be assigned a position in the coherence order after W'.
This would violate write-read coherence.
Thus R must be satisfied at some time after W commits,
and the only way to guarantee this is to
require that R be restarted when W commits.
Finally, the memory model assumes that the resources required to
carry out an atomic RMW operation (bus locks, reservation registers,
or whatever) are limited, and consequently a CPU cannot execute
two such operations concurrently or out of program order.
Thus, the read part of an RMW instruction cannot be satisfied until
all po-earlier RMW instructions have committed.
Likewise, the write part cannot commit until the read part has committed.
When we express these ordering requirements in the memory model,
it turns out that when a read instruction commits is relatively unimportant;
what really matters is when the read is satisfied for the last time.
Therefore we will say that a read executes when it is last
satisfied, whereas all other instructions execute when they commit.
In these terms, the ordering requirements take the following form.
Let A and B be instructions with A before
B in program order.
Then A must execute before B if:
- A is a conditional branch and B is a write
instruction controlled by A;
- There is a dependency from A to B;
- There is a dependency from A to a write that is
forwarded to B;
- B is a write and A is the source of an
address dependency to a memory access instruction between them;
- B is a write and A accesses the same address
as B;
- A and B are reads of the same address, and
B does not read from the same write as A and
is not forwarded from a write that is between them;
- A is a write and B is a read of the same address,
and B is not forwarded from A or from a write
that is between them;
- B is a barrier and A is in its pre-set;
- A is a barrier and B is in its post-set; or
- A and B are both atomic RMW instructions.
(Taken together, requirements 8 and 9 say that A must
execute before B whenever they are separated in program order
by a suitable barrier, such that A is in the barrier's pre-set
and B is in the barrier's post-set.)
These requirements were expressed almost verbatim in the
original version of the strong kernel memory model.
Together with the description of how the memory subsystem works,
they suffice to guarantee that the four coherence rules will always be obeyed.
The resulting model applied quite nicely to x86, Sparc, and PowerPC;
however, it was not accurate for some other architectures.
Adjustments for the DEC Alpha
Alpha deviates from the operational model described above in one
very significant way: It uses a split data cache.
In terms of the memory model, this means that writes which propagate
to a CPU in one order might be perceived by that CPU in the opposite
order.
For example, suppose that P0 writes to both x and y,
and the writes propagate in that order to P1.
If x and y happen to be located in different cache lines
and the two cache lines are handled by different parts of P1's data cache,
it may happen that the part of the cache responsible for handling the
write to x is busy while another part of the cache is able to
handle the write to y right away.
P1's CPU would then see y's new value before seeing x's,
a result that would not be allowed by the memory model as described above.
Here's an example litmus test to illustrate the point.
Strong Model Litmus Test #2
1 C alpha-split-cache-example1
2 {
3 int u = 0;
4 int v = 0;
5 int *p = &u;
6 }
7
8 P0(int **p, int *v)
9 {
10 WRITE_ONCE(*v, 1);
11 smp_mb();
12 WRITE_ONCE(*p, v);
13 }
14
15 P1(int **p)
16 {
17 int *r1;
18 int r2;
19
20 r1 = READ_ONCE(*p);
21 r2 = READ_ONCE(*r1);
22 }
23
24 exists (1:r1=v /\ 1:r2=0);
The smp_mb() in P0 forces the write to v to propagate
to P1 before the write to p, and
the address dependency from P1's read of p to its read of
*r1 forces these reads to be executed in program order.
Nevertheless, the split-cache arrangement may cause P1 to see the
new value of p (i.e., &v) and then the
old value of v (i.e., 0).
This odd behavior can be observed on real Alpha hardware,
and it shows up when we run the litmus test through the model:
Outcome for Strong Model Litmus Test #2
1 Test alpha-split-cache-example1 Allowed
2 States 3
3 1:r1=u; 1:r2=0;
4 1:r1=v; 1:r2=0;
5 1:r1=v; 1:r2=1;
6 Ok
7 Witnesses
8 Positive: 1 Negative: 2
9 Condition exists (1:r1=v /\ 1:r2=0)
10 Observation alpha-split-cache-example1 Sometimes 1 2
11 Hash=b73c984509551a6a5ffe49d86c9a2d04
Inserting a call to smp_read_barrier_depends() (see line 21):
Strong Model Litmus Test #3
1 C alpha-split-cache-example2
2 {
3 int u = 0;
4 int v = 0;
5 int *p = &u;
6 }
7
8 P0(int **p, int *v)
9 {
10 WRITE_ONCE(*v, 1);
11 smp_mb();
12 WRITE_ONCE(*p, v);
13 }
14
15 P1(int **p)
16 {
17 int *r1;
18 int r2;
19
20 r1 = READ_ONCE(*p);
21 smp_read_barrier_depends();
22 r2 = READ_ONCE(*r1);
23 }
24
25 exists (1:r1=v /\ 1:r2=0);
prevents the unwanted result:
Outcome for Strong Model Litmus Test #3
1 Test alpha-split-cache-example2 Allowed
2 States 2
3 1:r1=u; 1:r2=0;
4 1:r1=v; 1:r2=1;
5 No
6 Witnesses
7 Positive: 0 Negative: 2
8 Condition exists (1:r1=v /\ 1:r2=0)
9 Observation alpha-split-cache-example2 Never 0 2
10 Hash=9dbdccdc417b3ece717775094d822a49
The
smp_read_barrier_depends() call forces P1 to wait until
all writes that have already propagated to its cache have been fully handled
and are available for reading.
Thus, the new value of
v is always visible to P1 whenever it
sees
&v in
p.
In order to accomodate Alpha's unique behavior, we modified the memory model
to include a delay between the time when a write propagates to a CPU
and the time when the memory subsystem can use that write to satisfy
a read request.
Furthermore, instead of assuming a simple split, the model allows
the cache to be completely fragmented, with an independent segment for
each memory address.
The model introduces the notion of a horizon time.
For each processor Pn, any memory address A, and any time t,
the horizon time horiz(Pn,A,t)
is the time h (at or before t)
such that if Pn were to submit a read request for address A
to the memory subsystem at time t,
the response would be the coherence-latest write that executed on Pn
or that propagated to Pn before time h.
Writes to A that propagate to Pn after time h
are considered still to be “below the horizon” at time t,
and so are not visible and cannot be used for satisfying reads.
(This restriction does not apply to writes executed by Pn itself;
each processor can see its own writes at any time.)
The memory model requires that for each Pn and each address A,
the value of horiz(Pn,A,t) must increase with t.
This means a write cannot fall back below the horizon after
it has become visible; otherwise we could have a violation of the
read-read coherence rule.
The model also requires that every memory barrier behave like
smp_read_barrier_depends(), in that it forces the CPU to wait
for all writes that propagated to the CPU before the barrier committed
to become visible.
In other words, if Pn executes a memory barrier at time t then
it will not execute any read instructions in the barrier's post-set until
time t', where horiz(Pn,A,t') > t for all addresses A.
If the barrier is a strong one, the CPU is required to wait for all
writes that propagated to the CPU before the barrier was acknowledged
to become visible.
For any read instruction R, we naturally go on to define
horiz(R) (the horizon time for R)
to be horiz(Pn,A,exec(R)) where
Pn is the processor that executes R,
A is the address that R accesses,
and exec(R) is the time when R executes (is last satisfied).
Thus, R will read from the coherence-latest write that
has been executed by Pn or has propagated to Pn before horiz(R)
(and of course, it is always true that horiz(R) ≤ exec(R)).
Adding the concept of horizon times complicates the ordering of reads.
When we say that Pn orders read A before read B,
we could now mean any of four things:
For example, ordering requirement 6 is the case where
A
and
B read from different writes to the same address, and
B is not forwarded from a write that is between them.
In this situation, the ordering requirement states that the CPU must
execute
B after it executes
A, so
exec(A) ≤ exec(B).
But the read-read coherence rule says that the write which
B
reads from must come later in the coherence order
than the write which
A reads from.
If
B's write had propagated to the CPU before
horiz(A)
then
A would have read from it, since reads take their value
from the coherence-latest write available.
Hence the write must have propagated to the CPU at a time after
horiz(A) but before
horiz(B) (otherwise
B
would not have read from it).
This means it must also be true that
horiz(A) ≤ horiz(B),
thus explaining why ordering requirement 6 is listed twice in this table.
The first ordering relation above is the strongest;
it implies each of the others.
Under most circumstances a read's horizon time is more useful
than its execution time, so we will take the second alternative
to be the standard meaning for ordering of reads.
Nevertheless, the third alternative has its uses.
Most notably, it is the ordering imposed by the CPU when there is
an address dependency from one read to another but no memory barrier
separating them.
The fourth alternative is not used in the memory model.
The extended memory model reduces to the original
if we assume the horizon times for all memory addresses are always equal,
that is, horiz(Pn,A,t) is independent of A.
Under this assumption the extended model will allow the same set of behaviors
as the original,
which is reassuring; it means the model still applies as before
to architectures other than Alpha.
Adjustments for ARM
Unforunately, the memory model as developed above is not a very good fit
for the ARM architecture.
The published memory models for ARMv8 differ in a number of important respects
from the model we have described so far.
Most of the differences involve the memory subsystem,
and most of the differences that affect the processor subsystem
are concerned with how it interacts with the memory subsystem.
The difference with perhaps the most widespread ramifications involves
how the memory subsystem responds to read requests.
Earlier we said that the response would be the value stored by the
coherence-latest write that has propagated to the CPU making the request,
because otherwise the read might observe a coherence-earlier value than
a po-earlier read of the same address did,
violating the read-read coherence rule.
But this requirement is stronger than necessary; all we really need
is that the response to a read request should be coherence-later than
(or the same as) any po-earlier read responses or committed writes
for the same address.
It doesn't have to be the very latest write available,
and on ARM it often isn't.
Furthermore, the ARM memory model does not include any feature
analogous to acknowledging a strong memory barrier.
Instructions following such a barrier can be executed as soon as
the barrier has been committed.
These two facts have some rather subtle effects on the ordering properties
of memory accesses.
For example, either one of them invalidates the reasoning we used
when analyzing
Strong Model Litmus Test #1
above.
The test remains forbidden even on ARM, but not for the reasons we gave.
Instead, the ARM memory model guarantees that
if a write W reaches the coherence point before a strong
(smp_mb()) barrier, then the response to any read that is
po-after the barrier and targets the same address as W will be
the value stored by W or a coherence-later write.
(No similar guarantee is made by the Power-PC-based memory model
presented above,
which is an indication of how much ARM differs from PowerPC.)
This is enough to show that the “exists” clause in
Strong Model Litmus Test #1
will never be satisfied.
When the test is executed, one of the two memory barriers must reach
the coherence point before the other.
Suppose the barrier in P0 gets there first.
Since P0's write to x is in the barrier's pre-set,
it will reach the coherence point before the barrier does
and hence before P1's barrier does.
Thus P1's read of x, which is po-after the barrier,
is guaranteed to see P0's write (there aren't any coherence-later
writes to x in the test program),
and so r2 will end up equal to 1, not 0.
As before, the opposite case is symmetrical.
Another difference concerns the way CPUs execute writes.
On ARM, two writes to the same address are permitted to commit
out of program order.
Earlier we said that if this happened, it would cause the memory subsystem
to put the po-later write earlier in the coherence order,
thereby violating the write-write coherence rule.
ARM gets around this problem in a very straightforward way:
When a write W commits after a po-later write W'
to the same address, the CPU simply skips sending W
to the memory subsystem!
As a result, W never gets assigned an explicit location
in the coherence order (effectively, it ends up ordered immediately before
the next write, in program order, to the same address),
it never reaches the coherence point,
and it never becomes visible to any other CPUs.
We say that W has been obscured by W',
or more colloquially, erased.
Its effects don't disappear entirely, because the value stored by
W can still be forwarded to reads that lie between W
and W'.
But this is the next best thing;
from a system-wide standpoint, the end result is practically the same
as if W' had followed so closely on the heels of W that
W was overwritten before any other CPU had a chance to
read from it.
There are certain circumstances in which a write W
cannot be obscured.
For example, if the CPU encounters a memory barrier that orders W
before some other write W' to the same address,
then W' cannot commit until after the barrier does,
and the barrier cannot commit until after W does,
so the writes cannot commit out of program order
and W' will not obscure W.
Also, the writes associated with smp_store_release(),
rcu_assign_pointer(), or atomic RMW instructions
are not allowed to be obscured.
More trivially, W will not be obscured if there are no
po-later writes to the same address in its process to obscure it.
Taking all of this into account will complicate the final memory model,
as you might imagine.
There are some other, less important, differences in the operation
of the CPU subsystem in the ARM model:
- When an smp_store_release() instruction is committed,
the CPU does not issue a barrier request followed by a write request;
instead it issues a write request that is specially marked
as being a store-release.
- Similarly, an smp_load_acquire() instruction gives rise to
a single read request that is specially marked as being a load-acquire.
- A read instruction that is po-after an smp_load_acquire()
is not obliged to wait until the load-acquire instruction has been
committed; the CPU is allowed to issue the read's request any time
after the load-acquire request has been issued, and the two of them
may be satisfied out of order.
- An smp_rmb() instruction does not act entirely within the CPU;
it causes the CPU to issue a barrier request to the memory subsystem.
- The rules for restarting a read instruction after a po-earlier
instruction has accessed the same address are slightly looser;
the read does not need to be restarted if it was issued after the
po-earlier instruction was.
- A write instruction that accesses the same address as a po-earlier read
may be committed before the read is committed, provided the read has
already been issued and the CPU knows that it will not be restarted.
To better understand the ARM model,
we must examine how the memory subsystem works in some detail.
It is more highly structured than the memory subsystem in the Power-PC model,
consisting of a hierarchical arrangement of buffers lying between
the CPUs at the top and the memory at the bottom.
There is a buffer immediately below each CPU; these feed down into
some buffers below them, and then buffers below those, and so on,
down to the lowermost buffer, which feeds into memory.
The coherence point is the place at the bottom of the lowest buffer.
For example, a four-processor system would have four buffers at
the topmost layer, then at the next layer there might be
a buffer below CPUs 0 and 1 and another below CPUs 2 and 3,
and a single buffer below all the CPUs
in the final layer, as shown in this figure (panel A):

Other arrangements of buffers (such as that in panel B) are possible,
provided they follow the general hierarchical arrangement:
buffers always feed down, never up;
a buffer can receive input from multiple buffers above it but can
provide output only to a single buffer below; and there is a single
lowermost buffer which all the others eventually lead to.
The memory models do not specify the buffer sizes or topology,
and in practice you cannot even rely on the topology remaining unchanged
over time, because the scheduler can migrate a process from one CPU to another
thereby altering the arrangement of the buffers below that process.
(The essential difference between the Flowing and POP models is that
the Flowing model assumes a fixed buffer topology,
whereas the POP model does not keep explicit track of the buffers
and thus is compatible with any arrangement.
The POP model is more general, but the Flowing model is easier to
reason about.)
The main point of this design is that the memory subsystem does not provide
the response to a read request immediately.
Issuing a read request and receiving the response (which is then used
to satisfy the read) are two separate events, and the CPU is free
to work on other instructions in between.
When a CPU issues a write, barrier, or read request, the request
enters the CPU's buffer at the top, flows through the buffer and then
down into the top of the buffer below, and so on, eventually passing
out the bottom of the lowermost buffer, to memory.
Thus, the coherence order is simply the order in which write requests
flow down to memory.
When a write request reaches memory, the value in the request gets
stored at the write's target address.
When a barrier request reaches memory, its job is finished and it disappears.
And when a read request reaches memory, a response is generated using
the value held in memory at the read's target address.
However, a response to a read request may also be generated before
the request reaches memory, while it is still flowing through a buffer.
If the request immediately below the read is a write to the same address,
the memory subsystem can respond to the read using the value stored
by the write.
When this happens, the memory subsystem deletes the read request,
but it keeps track of the fact that the write request was used
to satisfy the read.
(Exception: a load-acquire request is not allowed to be satisfied by
a store-release request while still in a buffer.
The only way for a load-acquire instruction to read the value stored
by a store-release instruction is for the load-acquire request to flow
all the way down to memory.)
The flow of requests down through a buffer is not always
First-In-First-Out.
Subject to certain restrictions, a request is allowed to exchange
places with the request immediately below it (we say it passes
the lower request).
The complete list of restrictions is rather elaborate;
among the most important ones are:
- A read or write request may not pass another read or write
with the same target address.
- No barrier request may pass another barrier request.
- No request may pass an smp_mb() barrier or vice versa.
- An smp_wmb() barrier request may not pass a write request
from the same CPU, and it may not be passed by any write request.
- An smp_rmb() barrier request may not pass a write request
that was used to satisfy a read from the barrier's CPU,
and it may not be passed by a write or read request from the same
CPU or by a write request that was used to satisfy such a read.
- A store-release request may not pass any other request.
- A load-acquire request may not be passed by any read request
from the same CPU or by a write request that was used to satisfy
such a read.
A write is said to propagate to a CPU when its request flows into
a buffer below that CPU, because before that time there is no way for the CPU
to read from the value of the write, and afterward it is possible for a read
request issued by the CPU to be satisfied by the write request,
whether in a buffer or in memory.
This picture explains why a read might not be satisfied by
the coherence-latest write to have propagated to the read's CPU.
A read request R for the value of x, for example,
might not be satisfied until it reaches memory and obtains an old value,
even though a write request W containing a new value for x
may already have flowed down to a buffer below R's CPU.
Provided that W ends up higher than R in the
chain of buffers leading from the CPU to memory,
R is unable to read the value stored by W:
The write request can't pass the read request
because they have the same target address.
Thus R ends up being satisfied by the earlier value of x
even though the coherence-later value in W had already
propagated to R's CPU by that time.
Now we can also understand how the ARM memory model enforces
the guarantee mentioned above.
Suppose F is a strong fence and R is a read that is
po-after F.
Suppose also that W is a write to the same address as R
and W reaches the coherence point before F does.
Then W must flow down to memory before F, and since
R cannot pass F, it cannot reach memory before W.
Thus, if R is satisfied from memory then it must read the value
stored by W or a coherence-later write.
But what if R is satisfied while it is still in a buffer?
Let W' be the write request that satisfies R.
Since it is immediately below R in the buffer at the time
that R is satisfied, it must also be above F.
And since W' cannot pass F, it must reach memory
after W, which means it must be coherence-later than W.
The case where R is not issued at all but is forwarded from a
po-earlier write is left to the reader.
Regardless, no matter how things work out, in the end R will
read from W or a coherence-later write, as guaranteed.
Just as with the Alpha, the fact that reads are issued and satisfied
at different times leads to an ambiguity when we want to order read
instructions.
If we say that instruction A is ordered before B,
where one or both is a read, we could mean:
where
issue(A) is the time when
A's read request is
issued to the memory subsystem, and
exec(A) is the time when
A is executed (which is the time when the read response is
received, if
A is a non-forwarded read).
Adjustments for other architectures
Currently there are none.
This may change in the future as we become aware of the individual
requirements of other CPU families.
Design of the strong model
The following litmus tests illustrate the ideas behind A- and B-cumulativity
and release sequences.
Strong Model Litmus Test #4
1 C C-wmb-is-not-A-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 1);
9 }
10
11 P1(int *x, int *y)
12 {
13 r1 = READ_ONCE(*x);
14 smp_wmb();
15 WRITE_ONCE(*y, 1);
16 }
17
18 P2(int *x, int *y)
19 {
20 r2 = READ_ONCE(*y);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
This test's
exists clause can be satisfied.
Even though
P0's write to
x propagates to
P1
before the
smp_wmb() barrier commits,
as proved by the fact that
r1=1 at the end,
the write to
x is not in the barrier's pre-set,
because
smp_wmb() is not A-cumulative.
As a result, the barrier and the write to
y are allowed to
propagate to
P2 before the write to
x does.
In the end, this is just a fancy way of saying that
smp_wmb()
doesn't order two writes if the first write was executed on a different
CPU from that which executed the
smp_wmb().
Strong Model Litmus Test #5
1 C C-release-is-A-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 1);
9 }
10
11 P1(int *x, int *y)
12 {
13 r1 = READ_ONCE(*x);
14 smp_store_release(y, 1);
15 }
16
17 P2(int *x, int *y)
18 {
19 r2 = READ_ONCE(*y);
20 smp_rmb();
21 r3 = READ_ONCE(*x);
22 }
23
24 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
By contrast, this test's
exists clause cannot be satisfied.
Because
smp_store_release() is A-cumulative and because
P0's write to
x propagates to
P1
before the
smp_store_release() commits, the write is in the
release barrier's pre-set.
Consequently the new value of
x must propagate to
P2
before the store to
y can.
Since
P2 is forced to read
x after reading
y
(by the
smp_rmb()), and since it sees the new value of
y, it must also see the new value of
x.
Strong Model Litmus Test #6
1 C C-wmb-is-B-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_wmb();
10 WRITE_ONCE(*y, 1);
11 }
12
13 P1(int *y, int *z)
14 {
15 r1 = READ_ONCE(*y);
16 WRITE_ONCE(*z, r1);
17 }
18
19 P2(int *x, int *z)
20 {
21 r2 = READ_ONCE(*z);
22 smp_rmb();
23 r3 = READ_ONCE(*x);
24 }
25
26 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
B-cumulativity refers to writes that occur after the barrier.
Even though we don't tend to think of
smp_wmb() as ordering
writes carried out by other CPUs, in every Linux-supported architecture
for which we know the details, it does, courtesy of the fact that
smp_wmb() is B-cumulative.
In this example, the write to x, the smp_wmb() barrier,
and the write to y must propagate from P0 to P1
in order.
The data dependency from P1's read of y to its write
of z forces the write to occur after the new value of y
has been seen, and hence after the smp_wmb() barrier has propagated
to P1.
As a result, the write to z is in the barrier's post-set,
so the barrier must propagate to P2 before the write can.
That's what being B-cumulative means.
Before P2 can read the new value of z, the barrier and
hence the new value of x must have propagated there.
Therefore P2's read of x must see the new value,
and so the exists clause cannot be satisfied.
Strong Model Litmus Test #7
1 C C-release-is-not-B-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 }
11
12 P1(int *y, int *z)
13 {
14 r1 = READ_ONCE(*y);
15 WRITE_ONCE(*z, r1);
16 }
17
18 P2(int *x, int *z)
19 {
20 r2 = READ_ONCE(*z);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
As before, the write to
x, the
smp_store_release()'s
barrier, and the write to
y must propagate from
P0
to
P1 in order, and so the write to
z must occur
after the barrier has reached
P1.
Nevertheless, because
smp_store_release() is not B-cumulative,
the write to
z isn't in the barrier's post-set,
and so the new value of
z is allowed to propagate to
P2
before either the barrier or the new value of
x.
Consequently it is possible for
P2 to read the new value of
z followed by the old value of
x.
Despite what the previous
example shows, in this memory model
smp_store_release() is B-cumulative along pathways where
it is read by smp_load_acquire().
The following example illustrates this point.
Strong Model Litmus Test #8
1 C C-release-acquire-is-B-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 }
11
12 P1(int *y, int *z)
13 {
14 r1 = smp_load_acquire(y);
15 WRITE_ONCE(*z, 1);
16 }
17
18 P2(int *x, int *z)
19 {
20 r2 = READ_ONCE(*z);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
In this litmus test,
P0's
smp_store_release() is read
by
P1's smp_load_acquire().
As a result, the release barrier acts B-cumulatively and so
P1's
write of
z cannot propagate to
P2 until the barrier has.
Hence it is not possible for
P2 to read the new value of
z
followed by the old value of
x.
Note that this applies only along the pathway of the
smp_load_acquire().
This example:
Strong Model Litmus Test #9
1 C C-release-B-cumulative-only-on-acquire-path.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 }
11
12 P1(int *y, int *z)
13 {
14 r1 = READ_ONCE(*y);
15 WRITE_ONCE(*z, r1);
16 }
17
18 P2(int *x, int *z)
19 {
20 r2 = READ_ONCE(*z);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 P3(int *y)
26 {
27 r4 = smp_load_acquire(y);
28 }
29
30 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0 /\ 3:r4=1)
is the same as the
Strong Model Litmus Test #7
example above, except that it has a fourth thread P3 which uses
smp_load_acquire() to read the value of
y stored
by
P0.
However, this interaction does not cause the release barrier's effect
on
P1 to be B-cumulative;
the
exists clause is still allowed to succeed.
(The barrier's effect on P3
is B-cumulative, but the example
does not probe this fact.)
The following litmus test shows a non-trivial use of a release sequence.
Strong Model Litmus Test #10
1 C C-relseq.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 WRITE_ONCE(*y, 2);
11 }
12
13 P1(int *y)
14 {
15 r1 = xchg_relaxed(y, 3);
16 }
17
18 P2(int *x, int *y)
19 {
20 r2 = READ_ONCE(*y);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 exists (1:r1=2 /\ 2:r2=3 /\ 2:r3=0)
The release sequence headed by
P0's
smp_store_release()
to
y includes
WRITE_ONCE(*y, 2), because that
write is po-after the store-release.
It also includes the atomic
xchg_relaxed() operation in
P1,
because that operation accesses
y and reads from a write in the
release sequence (the
WRITE_ONCE()).
Consequently
P1's atomic write belongs to the release barrier's
post-set, and it cannot propagate to
P2 before the
barrier and the write to
x do.
Note that smp_store_release() barriers do not become B-cumulative
along paths where smp_load_acquire() reads from an arbitrary
member of the release sequence,
but only when the load-acquire reads directly from the store-release itself.
This is illustrated by the following example.
Strong Model Litmus Test #11
1 C C-relseq.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 WRITE_ONCE(*y, 2);
11 }
12
13 P1(int *y)
14 {
15 r1 = xchg_relaxed(y, 3);
16 }
17
18 P2(int *y, int *z)
19 {
20 r2 = smp_load_acquire(y);
21 WRITE_ONCE(*z, 1);
22 }
23
24 P3(int *x, int *z)
25 {
26 r3 = READ_ONCE(*z);
27 smp_rmb();
28 r4 = READ_ONCE(*x);
29 }
30
31 exists (1:r1=2 /\ 2:r2=3 /\ 3:r3=1 /\ 3:r4=0)
Here the
smp_load_acquire() in P2 reads from the
xchg_relaxed() in P1, which is part of the
smp_store_release()'s release sequence as before.
But because it does not read directly from the
smp_store_release()
instruction, it does not cause the barrier to act B-cumulatively.
Hence P2's write to
z is allowed to propagate to P3 before
P0's release barrier or write to
x.
The full
Bell
file for Alan Stern's strong model
(strong-kernel.bell)
is as follows:
1 "Linux kernel strong memory model"
2
3 (* Copyright (C) 2016 Alan Stern <stern@rowland.harvard.edu> *)
4
5 let RMW = domain(rmw) | range(rmw)
6
7 enum Accesses = 'once (*READ_ONCE,WRITE_ONCE,ACCESS_ONCE*) ||
8 'release (*smp_store_release*) ||
9 'acquire (*smp_load_acquire*) ||
10 'assign (*rcu_assign_pointer*) ||
11 'deref (*rcu_dereference*) ||
12 'lderef (*lockless_dereference*)
13 instructions R[{'once,'acquire,'deref,'lderef}]
14 instructions W[{'once,'release,'assign}]
15 instructions RMW[{'once,'acquire,'release}]
16
17 enum Barriers = 'wmb (*smp_wmb*) ||
18 'rmb (*smp_rmb*) ||
19 'mb (*smp_mb*) ||
20 'rb_dep (*smp_read_barrier_depends*) ||
21 'rcu_read_lock (*rcu_read_lock*) ||
22 'rcu_read_unlock (*rcu_read_unlock*) ||
23 'sync (*synchronize_rcu*)
24 instructions F[Barriers]
25
26 let rmb = fencerel(Rmb) & (R*R)
27 let wmb = fencerel(Wmb) & (W*W)
28 let mb = fencerel(Mb)
29 let sync = (po & (_ * Sync)) ; (po?)
30
31 let acq-po = po & (Acquire*_)
32 let po-rel = po & (_*Release)
33 let po-assign = po & (_*Assign)
34
35 let rb-dep = fencerel(Rb_dep) & (R*R)
36 let deref-po = po & (Deref*M)
37 let lderef-po = po & (Lderef*M)
38
39 let rd-dep-fence = rb-dep | deref-po | lderef-po
40 let exec-order-fence = rmb | acq-po
41 let weak-fence = wmb
42 let medium-fence = po-rel | po-assign
43 let strong-fence = mb | sync
44
45 let transitive-fence = strong-fence | medium-fence
46 let propagation-fence = transitive-fence | weak-fence
47 let ordering-fence = propagation-fence | exec-order-fence
48
49 (* Compute matching pairs of nested Rcu_read_locks and Rcu_read_unlocks *)
50 let matched = let rec
51 unmatched-locks = Rcu_read_lock \ domain(matched)
52 and unmatched-unlocks = Rcu_read_unlock \ range(matched)
53 and unmatched = unmatched-locks | unmatched-unlocks
54 and unmatched-po = (unmatched * unmatched) & po
55 and unmatched-locks-to-unlocks = (unmatched-locks *
56 unmatched-unlocks) & po
57 and matched = matched | (unmatched-locks-to-unlocks \
58 (unmatched-po ; unmatched-po))
59 in matched
60
61 (* Validate nesting *)
62 flag ~empty Rcu_read_lock \ domain(matched) as unbalanced-rcu-locking
63 flag ~empty Rcu_read_unlock \ range(matched) as unbalanced-rcu-locking
64
65 (* Outermost level of nesting only *)
66 let crit = matched \ (po^-1 ; matched ; po^-1)
Taking this one piece at a time:
- Bell File: Memory Accesses.
- Bell File: Barriers.
-
Bell File: Relating Barriers and Memory Accesses.
-
Bell File: Relating One-Sided Barriers and Memory Accesses.
-
Bell File: Classes of Fences.
-
Bell File: RCU Read-Side Critical Sections.
The “"Linux kernel strong memory model"” is a name
that has no effect on the model's meaning.
The following portion of the Bell file defines the
types of memory accesses, which correspond to the Linux kernel's
READ_ONCE(),
WRITE_ONCE(),
ACCESS_ONCE(),
smp_store_release(),
smp_load_acquire(),
rcu_assign_pointer(),
rcu_dereference(), and
lockless_dereference() primitives:
5 let RMW = domain(rmw) | range(rmw)
6
7 enum Accesses = 'once (*READ_ONCE,WRITE_ONCE,ACCESS_ONCE*) ||
8 'release (*smp_store_release*) ||
9 'acquire (*smp_load_acquire*) ||
10 'assign (*rcu_assign_pointer*) ||
11 'deref (*rcu_dereference*) ||
12 'lderef (*lockless_dereference*)
13 instructions R[{'once,'acquire,'deref,'lderef}]
14 instructions W[{'once,'release,'assign}]
15 instructions RMW[{'once,'acquire,'release}]
The “enum Accesses” statement defines the
types of memory references, corresponding to the C functions listed
in the comments.
These correspondences are defined in herd's
linux.def
macro file.
The “instructions R” identifies which of the above
types of memory references may be associated with a read instruction,
the “instructions W” identifies which may be associated
with a write instruction, and the “instructions RMW”
identifies which may be associated with a read-modify-write instruction.
For example, the association of 'acquire with the R
set of instructions corresponds to the Linux kernel's
smp_load_acquire() primitive.
The predefined “rmw” relation links the read and the
write component of each read-modify-write (RMW) operation.
The “let RMW” statement thus creates the
set of all read and write events corresponding to an RMW instruction
in the litmus-test program.
(This shouldn't be necessary, because the RMW
set is supposed to be built-in to herd.
However at the time this memory model was created,
the implementation of the RMW set wasn't working,
so it was necessary to define the set manually.)
Note well that the above code simply defines names for the Linux-kernel
memory-access primitives.
The herd tool also uses this code to check the instruction
annotations in the linux.def file,
for example, __load{acquire) is legal but __load(release)
is not.
Later code in both the Bell and Cat files will define their effect on
memory ordering.
The next portion of the Bell file defines the
types of barrier-like constructs, namely
smp_wmb(),
smp_rmb(),
smp_mb(),
smp_read_barrier_depends(),
rcu_read_lock(),
rcu_read_unlock(), and
synchronize_rcu().
17 enum Barriers = 'wmb (*smp_wmb*) ||
18 'rmb (*smp_rmb*) ||
19 'mb (*smp_mb*) ||
20 'rb_dep (*smp_read_barrier_depends*) ||
21 'rcu_read_lock (*rcu_read_lock*) ||
22 'rcu_read_unlock (*rcu_read_unlock*) ||
23 'sync (*synchronize_rcu*)
24 instructions F[Barriers]
The “enum Barriers” defines the types of barriers
corresponding to the C functions listed in the comments
(as set up in the
linux.def
macro file).
The “instructions F[Barriers]” says that these
types may be used in various sorts of barrier instructions.
Quick Quiz 1:
Given that this is about memory barriers, why
“instructions F[Barriers]” instead of perhaps
“instructions B[Barriers]”?
Answer
As with the memory accesses, the above code only defines names.
These barrier-like constructs' ordering properties will be defined by
later code in the Bell and Cat files.
The next portion of the Bell file defines the relation between a given
barrier-like construct and its process's surrounding memory accesses:
26 let rmb = fencerel(Rmb) & (R*R)
27 let wmb = fencerel(Wmb) & (W*W)
28 let mb = fencerel(Mb)
29 let sync = (po & (_ * Sync)) ; (po?)
The standard library's fencerel(S) function returns
a relation containing
all pairs of events in which the first event precedes (in program order)
an event in the S set
(for example, an “Rmb” event
in the case of the line 26 above)
and the second follows it, with all three events being in the same thread.
As an example, the following snippet:
r1 = READ_ONCE(*x);
smp_rmb();
r2 = READ_ONCE(*y);
smp_mb();
WRITE_ONCE(*z, r3);
would produce an “
rmb” relation containing only one link:
- r1 = READ_ONCE(*x) ⟶
r2 = READ_ONCE(*y)
and an “
mb” relation containing three links:
- r1 = READ_ONCE(*x) ⟶
WRITE_ONCE(*z, r3),
- smp_rmb() ⟶
WRITE_ONCE(*z, r3), and
- r2 = READ_ONCE(*y) ⟶
WRITE_ONCE(*z, r3).
The “
rmb” relation doesn't include the other possible
links because of the
“
& (R*R)” clause in its definition,
which intersects the full
fencerel(Rmb) relation
with the relation containing all pairs of reads (
R*R).
This is appropriate because
smp_rmb() orders only reads,
not writes.
(For database programming fans,
the “&” operator can be thought of
as doing a database full equijoin operation, so that the result
is only those elements that appear in both operands.
Similarly, the “*” operator can be thought of as a
database unconstrained join operation, in this case
providing all combinations of pairs of read events.
Later on, we will encounter operations that cannot be easily
represented by
SQL,
so we will shift to the notation used
for mathematical sets.)
The “let wmb = fencerel(Wmb) & (W*W)” definition
acts similarly, but it extracts pairs of writes rather than reads, as required
for smp_wmb().
The “let mb = fencerel(Mb)” definition
keeps all events in the fencerel(Mb) relation,
as required for smp_mb().
(It even keeps events that don't correspond to memory accesses,
such as the smp_rmb() event in the above example,
although they are irrelevant here.)
Finally, the “let sync = (po & (_ * Sync)) ; (po?)”
definition uses a modified formula in place of
“fencerel(Sync)”.
It is different from the others in that it also includes pairs
where the second event is the synchronize_rcu() call
rather than something following it.
Otherwise it is like the definition of the mb relation.
Quick Quiz 2:
Why wouldn't “let sync = fencerel(Sync)”
work just as well as the modified definition?
Answer
This portion of the Bell file relates smp_rmb(),
smp_wmb(), smp_mb(), and synchronize_rcu()
to the surrounding code within a given process, but says nothing about
cross-process ordering properties, which will be defined in later
Bell and Cat code.
The next portion of the Bell file defines some relations involving
“one-sided” barriers
(smp_load_acquire(),
smp_store_release(),
rcu_assign_pointer(),
smp_read_barrier_depends(),
rcu_dereference(), and
lockless_dereference())
and their surrounding instructions:
31 let acq-po = po & (Acquire*_)
32 let po-rel = po & (_*Release)
33 let po-assign = po & (_*Assign)
34
35 let rb-dep = fencerel(Rb_dep) & (R*R)
36 let deref-po = po & (Deref*M)
37 let lderef-po = po & (Lderef*M)
The “acq-po” line defines the relation
appropriate for smp_load_acquire() operations.
This is the intersection of the program order (po) relation with
the set of all pairs of events in which the first is an Acquire
and the second can be anything (the “_” wildcard).
The “po-rel” definition works quite similarly, but with
prior memory accesses rather than subsequent ones and with
releases rather than acquires.
The “po-assign” definition works the same as
“po-rel”, but for rcu_assign_pointer()
rather than smp_store_release().
Consider the following example containing code fragments
running on two threads, where x, y, and
z are all initially zero:
Thread 0 Thread 1
-------- --------
WRITE_ONCE(*x, 1); r2 = smp_load_acquire(y);
r1 = READ_ONCE(*z); r3 = READ_ONCE(*x);
smp_store_release(y, 1); WRITE_ONCE(*z, 1);
This results in the following
po links:
- WRITE_ONCE(*x, 1) ⟶
r1 = READ_ONCE(*z),
- WRITE_ONCE(*x, 1) ⟶
smp_store_release(y, 1),
- r1 = READ_ONCE(*z) ⟶
smp_store_release(y, 1),
- r2 = smp_load_acquire(y) ⟶
r3 = READ_ONCE(x);,
- r2 = smp_load_acquire(y) ⟶
WRITE_ONCE(z, 1),
- r3 = READ_ONCE(*x); ⟶
WRITE_ONCE(*z, 1).
The first three links relate events in Thread 0 and
the last three relate events in Thread 1.
(The number of links in po is clearly quadratic
in the number of statements in a given thread, but that is OK because
several other things are exponential!
Knowing this, you can understand
why this sort of verification technique is unlikely to
handle all 20 million lines of the Linux kernel at one go.
Instead, these techniques should be applied to small but critical
segments of code.)
In this example, there is only one Acquire event:
“r2 = smp_load_acquire(y)”.
Intersecting po with the set of
all pairs of events in which the first is an Acquire
gives the acq-po relation:
- r2 = smp_load_acquire(y) ⟶
r3 = READ_ONCE(x);,
- r2 = smp_load_acquire(y) ⟶
WRITE_ONCE(z, 1),
This naturally lists all pairs of instructions whose execution order is
constrained by Thread 1's
smp_load_acquire().
The “rb-dep” definition is the same as that of
“rmb” earlier,
except that it applies to smp_read_barrier_depends()
instead of smp_rmb().
The “deref-po” definition is the same as that of
“acq-po”, but for rcu_dereference()
instead of smp_load_acquire().
The “lderef-po” definition is the same as that of
“deref-po”,
but for lockless_dereference().
Note that these three relations do not
correspond exactly to ordering constraints,
because smp_read_barrier_depends(), rcu_dereference(),
and lockless_dereference() only
order pairs of accesses where the second is “dependent” on the first
(more precisely, where there is an address dependency between them);
this restriction is described in more detail later on.
Note also that this portion of the Bell file defined only the relationships
between these one-sided barriers and the surrounding code within a
given process.
Cross-process ordering properties are defined by later Bell and Cat code.
The next portion of the Bell file groups fences by strength:
39 let rd-dep-fence = rb-dep | deref-po | lderef-po
40 let exec-order-fence = rmb | acq-po
41 let weak-fence = wmb
42 let medium-fence = po-rel | po-assign
43 let strong-fence = mb | sync
44
45 let transitive-fence = strong-fence | medium-fence
46 let propagation-fence = transitive-fence | weak-fence
47 let ordering-fence = propagation-fence | exec-order-fence
The members of the rd-dep-fence group
(smp_read_barrier_depends(),
rcu_dereference(), and
lockless_dereference())
cannot provide any ordering at all unless a dependency is also present.
The members of the exec-order-fence group
(smp_rmb() and smp_load_acquire())
can be thought of as providing
ordering by restricting execution, for example, waiting for previous reads
to complete before executing subsequent instructions.
(In practice, hardware architects have all sorts of optimizations at
their disposal that provide the needed ordering without necessarily
actually waiting.)
The ordering properties of any member of the rd-dep-fence and
exec-order-fence groups do not propagate outside of that member's
process.
Such fences cannot provide global ordering except in situations involving only
causal reads-from (rf) links; any non-causal coherence or
from-read links (co or fr, respectively) require
a stronger type of barrier.
In contrast, the ordering properties of the lone member of the
weak-fence group (smp_wmb())
does propagate outside of its process, but no more than one hop away.
Added strength provides increased propagation, so that
the ordering properties of the stronger barriers
can propagate through an arbitrarily large number of hops,
that is to say, the stronger barriers are transitive.
For example, the members of the medium-fence group
(smp_store_release() and rcu_assign_pointer())
are transitive, but this ordering is guaranteed in general
only within a set of processes using these fences in an organized manner
(for example, pairing smp_store_release() in one process with
smp_load_acquire() in the next process).
The ordering might not be visible to an unrelated process
unless at least one member of the set uses a member of the
strong-fence group.
Finally, the members of the strong-fence group can enforce full
globally visible transitive ordering.
Pervasive use of the
members of the strong-fence family will result in agreement
on the order even of completely unrelated memory references.
In fact, as noted
earlier,
placing one of these strong fences between each pair of memory
references in each process will forbid all but SC executions.
On the other hand, stronger fences often incur larger performance penalties.
[ @@@ I suspect that we need example litmus tests for the above. Thoughts? ]
These groups will be used in the Cat file to organize the various
ordering requirements.
The final section of the Bell file is the most complex, due to the
fact that rcu_read_lock() and rcu_read_unlock()
must come in matching pairs within a given process and can be nested.
Therefore, the purpose of the following code is to find the outermost
pair of rcu_read_lock() and rcu_rcu_unlock() invocations
in a single nested set, and to differentiate correctly between any
unrelated nested sets in a given process.
49 (* Compute matching pairs of nested Rcu_read_locks and Rcu_read_unlocks *)
50 let matched = let rec
51 unmatched-locks = Rcu_read_lock \ domain(matched)
52 and unmatched-unlocks = Rcu_read_unlock \ range(matched)
53 and unmatched = unmatched-locks | unmatched-unlocks
54 and unmatched-po = (unmatched * unmatched) & po
55 and unmatched-locks-to-unlocks = (unmatched-locks *
56 unmatched-unlocks) & po
57 and matched = matched | (unmatched-locks-to-unlocks \
58 (unmatched-po ; unmatched-po))
59 in matched
60
61 (* Validate nesting *)
62 flag ~empty Rcu_read_lock \ domain(matched) as unbalanced-rcu-locking
63 flag ~empty Rcu_read_unlock \ range(matched) as unbalanced-rcu-locking
64
65 (* Outermost level of nesting only *)
66 let crit = matched \ (po^-1 ; matched ; po^-1)
The “matched” relation is defined by the
mutually recursive set of definitions on lines 50-59.
The idea behind this code is to associate an unmatched
Rcu_read_lock event with a later unmatched Rcu_read_unlock event
whenever no unmatched events lie between them,
and to repeat this operation recursively until nothing more can be matched.
To that end, lines 51-53 form the sets of not-yet-matched
Rcu_read_lock and Rcu_read_unlock events and their union.
Line 54 then forms the relation of all pairs of these unmatched
events that occur in the same thread, in program order.
Lines 55-56 similarly form the relation of all such pairs
where the first member of the pair is a Rcu_read_lock event
and the second is an Rcu_read_unlock.
The interesting part is lines 57-58, which take
pairs of unmatched Rcu_read_lock and Rcu_read_unlock events
and add them to the “matched” relation,
but only if there are no unmatched events in between.
They do this by applying the
“\” (backslash) subtraction operator to remove from
the unmatched-locks-to-unlocks relation
any pairs having an
intervening unmatched Rcu_read_lock or Rcu_read_unlock.
The “;” operator sequences relations
(if relation x contains a⟶b
and relation y contains b⟶c
then (x ; y) will contain a⟶c).
In this case, you can see that
“unmatched-po ; unmatched-po”
contains all pairs a⟶c of unmatched events for which
a third unmatched event b lies between them in program order.
The only purpose of line 59 is to prevent the
unmatched-locks,
unmatched-unlocks,
unmatched,
unmatched-po, and
unmatched-locks-to-unlocks
definitions from leaking out to the surrounding context.
(Grammatically speaking, the construction used here is a
let rec expression inside a let statement.
In fact, let or let rec expressions
are very much like GCC's statement expressions;
the statement in lines 50-59 is syntactically analogous to
“x = ({int x = u; if (x < v) x = v; x;})”.)
Line 62 then checks whether there are any unmatched Rcu_read_lock events,
and line 63 does the same for unmatched Rcu_read_unlock events.
The “flag ~empty” statement flags the litmus test
as containing a semantic error if the specified set isn't empty,
and the “as ...” clause merely
provides a name to identify the particular failure mode.
Lastly, line 66 computes those matching pairs which lie
at the outermost level of nesting.
They are the important ones, because they delimit
RCU read-side critical sections.
It does this by subtracting from “matched”
all pairs which lie entirely between another matched pair.
The “^-1” inversion operator computes the
converse of a given relation; that is, it computes the collection
of all links a⟶b such that
b⟶a is in the given relation.
Thus, po^-1 contains all pairs of events in reverse
program order.
To see how “(po^-1 ; matched ; po^-1)” selects
inner matched pairs, consider the following example:
1 rcu_read_lock();
2 rcu_read_lock();
3 rcu_read_unlock();
4 rcu_read_unlock();
Starting at line 2, a “
po^-1” step takes us back to
line 1, a “
matched” step takes us to line 4,
and a second “
po^-1” takes us back to line 3.
Thus, this expression correctly identifies line 2 ⟶ line 3
as an inner matched pair.
You can easily see that this mechanism will remove from the
matched relation any
matched pairs that are nested within another matched pair.
We are now ready to proceed to the Cat file.
The full
cat
file (linux.cat) for Alan Stern's
strong model
(strong-kernel.cat)
is as follows:
1 "Linux kernel memory model"
2
3 (* Alan Stern, 31 May 2016 *)
4
5 include "cos.cat"
6
7 let com = rf | co | fr
8 let coherence-order = po-loc | com
9 acyclic coherence-order as coherence
10
11 empty rmw & (fre;coe) as atomic
12
13
14 let rdep = addr & (_*R) & rd-dep-fence
15 let dep = addr | data
16 let dep-rfi = dep ; rfi
17 let rdw = po-loc & (fre ; rfe)
18 let detour = po-loc & (coe ; rfe)
19 let atomicpo = (RMW*RMW) & po
20 let addrpo = addr ; po
21
22 let ppo =
23 rdep | dep-rfi | rdw |
24 detour | atomicpo |
25 ((dep | po-loc | ctrl | addrpo) & (_*W))
26 let strongly-hb = ppo | fence | rfe
27 let obs = ((coe|fre) ; barrier+ ; rfe) & int
28
29 let rec transitive-propbase = rfe? ; transitive-fence ; hb*
30 and transitive-obs = (hb* ; (coe|fre) ; barrier* ;
31 transitive-propbase) & int
32 and hb = strongly-hb | (addr+ ; strongly-hb) |
33 obs | transitive-obs
34
35 acyclic hb as causality
36
37
38 let propbase = barrier | transitive-propbase
39 let strong-prop = fre? ; propbase* ; rfe? ; strong-fence ; hb*
40 let prop = (transitive-propbase & (W*W)) | strong-prop
41 let atomic-hb = hb+ & ((RMW&W) * _)
42 let cpord = co | prop | atomic-hb
43
44 acyclic cpord as propagation
45
46
47 (* Propagation between strong fences *)
48 let basic = hb* ; cpord* ; fre? ; propbase* ; rfe?
49
50 (* Chains that can prevent the RCU guarantee *)
51 let s-link = sync ; basic
52 let c-link = po ; crit^-1 ; po ; basic
53 let rcu-path0 = s-link |
54 (s-link ; c-link) |
55 (c-link ; s-link)
56 let rec rcu-path = rcu-path0 |
57 (rcu-path ; rcu-path) |
58 (s-link ; rcu-path ; c-link) |
59 (c-link ; rcu-path ; s-link)
60
61 irreflexive rcu-path as rcu
Quick Quiz 3:
This strong model is insanely complex!!!
How can anyone be expected to understand it???
Answer
First, the “memory barriers” is the title,
and the “include "cos.cat"” pulls in some
common definitions, similar to the C language's
“#include <stdio.h>”.
Again, taking the remainder of the file one piece at a time:
-
Cat File: SC Per Location and Atomics.
-
Cat File: Intra-Thread Ordering.
- Cat File: Happens-Before.
- Cat File: Coherence Points.
- Cat File: RCU.
The first section of the litmus.cat file defines
SC per location, which again means that all CPUs agree on the
order of reads and writes to any given single location.
Therefore, any situation where CPUs disagree on the order of reads
and writes must involve more than one variable.
This section also provides ordering constraints for RMW atomic
operations.
1 let com = rf | co | fr
2 let coherence-order = po-loc | com
3 acyclic coherence-order as coherence
4
5 empty rmw & (fre;coe) as atomic
The “com” relation shown on line 1
is the union of:
- Coherence (co), which connects all writes
to any given variable, in the order that those writes
were executed.
- Reads-from (rf), which connects each read with
the write that produced the value read.
Note that initial values are considered to be “before the
beginning of time” writes, where time is measured by
the co ordering.
- From-reads (fr), which connects each read with
the writes to that same variable that follow the write
(in co order) producing the value read.
The resulting “com” relation tracks the
communication of data, hence the name.
The predefined “po-loc” relation intersects the
program-order relation “po” with the
per-location “loc” relation.
This results in “po-loc” being an union of
relations, one per variable, connecting all per-thread accesses to any
given variable, in program order.
The “coherence-order” relation on line 2
takes the
union of “po-loc” and “com”,
which combines the communication of data with the order in which each
location is accessed by each thread, but maintaining all relations on
a per-location basis.
The “acyclic” constraint on line 3
prohibits cycles in the resulting “coherence”
relation, in other words, that everyone agrees on the order of accesses
to each location.
Line 5 enforces the atomicity RMW operations on a given variable:
More specifically no write to the given variable can intervene between the read
and the write of the RMW operation.
Recall that the “rmw” relationship connects a
given RMW operation's read to its write.
Note also that “fre;coe” connects any read to a given
variable to some later write to that same variable, where at least
one of the intervening writes was executed by some other thread.
If the initial read was a given RMW operation's read and the final
write was this same RMW operation's write, atomicity has been violated:
Some other thread's write appeared after the RMW's read but before its
write.
Therefore, line 5 requires that the intersection of
“rmw” and “fre;coe” be the
empty set, thus prohibiting violations of atomicity.
The next portion of the file defines intra-thread ordering relationships.
Here “intra-thread” means that the ordered accesses are within
the same thread.
Some of the relationships will reference other threads.
1 let rdep = addr & (_*R) & rd-dep-fence
2 let dep = addr | data
3 let dep-rfi = dep ; rfi
4 let rdw = po-loc & (fre ; rfe)
5 let detour = po-loc & (coe ; rfe)
6 let atomicpo = (RMW*RMW) & po
7 let addrpo = addr ; po
The “addr” and
“data”
relations define address and data dependencies respectively.
An address dependency occurs when a previously loaded value is used
to form the address of a subsequent load or store within the same thread.
A data dependency occurs when a previously loaded value is used to form
the value stored by a subsequent store within the same thread.
However, the Linux kernel respects neither address nor data dependencies
unless: (1) The dependency is headed by rcu_dereference()
or lockless_dereference() or
(2) There is an smp_read_barrier_depends() between the
load heading the dependency chain and the dependent memory reference.
This requirement for a special operation helps to document the intent,
and also allows architectures to include any special instructions
required to enforce dependency ordering, for example, DEC Alpha
requires a memory barrier if the dependent access is a read.
Line 1 defines the “rdep” by intersecting
the “rd-dep-fence” (which covers
rcu_dereference(), lockless_dereference() and
smp_read_barrier_depends()) with the set of address
dependencies and with “(_*R)”, which
is the set of pairs of operations where the second member of the
pair is a read.
This results in address dependencies leading to a read, but only
those cases that enforce ordering between the load of the address
and the read from that address.
This distinction is necessary for DEC Alpha: For all other systems,
the definition of “rdep” could omit the
intersection with “rd-dep-fence”.
Quick Quiz 4:
For what code would this distinction matter?
Answer
Line 2 defines the “dep”
relationship, which is simply the union of address and data
dependencies.
Line 3 defines the “dep-rfi”
relationship, which contains dependencies leading to a store,
but where that store is read by a later load in that same thread.
Line 4 defines the “rdw”
relationship, which contains load-store pairs within a given thread,
where the load and store are to the same variable, but where at least
one store to this same variable from some other thread intervened
between this thread's load and store.
Line 5 does the same for store-load pairs, resulting in
the “detour” relationship.
Line 6 forms the “atomicpo” relationship,
which accumulates pairs of RMW operations where both members of
each pair are on the same thread, and where the first member of the
pair precedes the second member in program order.
Finally, line 7 defines the “addrpo” relationship,
which relates operations heading address dependencies with any operations
following the dependent operation in program order.
Quick Quiz 5:
Why would an operation following an address dependency get any special
treatment?
After all, there does not appear to be any particular ordering relationship
in the general case.
Answer
The next portion of the file combines the effects of dependencies,
barriers and grace periods to arrive at a causally ordered happens-before
(“
hb”) relationship.
1 let ppo =
2 rdep | dep-rfi | rdw |
3 detour | atomicpo |
4 ((dep | po-loc | ctrl | addrpo) & (_*W))
5 let strongly-hb = ppo | fence | rfe
6 let obs = ((coe|fre) ; barrier+ ; rfe) & int
7
8 let rec transitive-propbase = rfe? ; transitive-fence ; hb*
9 and transitive-obs = (hb* ; (coe|fre) ; barrier* ;
10 transitive-propbase) & int
11 and hb = strongly-hb | (addr+ ; strongly-hb) |
12 obs | transitive-obs
13
14 acyclic hb as causality
The “ppo” (preserved program order) relationship
is defined on lines 1-4.
This definition simply unions the sets of fence-free relationships
for which ordering is guaranteed when the corresponding operations are
executed within a given thread.
The prohibition against speculating writes is taken into account with the
intersection with “(_*W)” on line 4.
Quick Quiz 6:
Why does “ppo” intersect “po-loc”
with “(_*W)”?
Don't we need to enforce full cache coherence, not just cache coherence
for trailing writes?
Answer
Quick Quiz 7:
Why do rcu_dereference() and
lockless_dereference() respect control dependencies?
Answer
Line 5 defines “strongly-hb”
(strongly happens-before), which combines
“ppo” with fences and with cross-thread reads-from
relationships (“rfe”).
This relationship provides those single-step causal relationships
that remain causal on DEC Alpha.
Line 6 defines the “obs” (observation)
set of relationships, in which a read operation observes
the indirect effect of one of that thread's preceding writes,
mediated by a read and a write separated by at least one barrier
on some other thread.
This set contains pairs of operations on a given
thread (courtesy of the intersection with “int”)
that are ordered by at least one fence on some other thread.
The “(coe|fre)” gets us to that other thread,
the “barrier+” is one or more memory-barrier
instructions (including smp_load_acquire() and
smp_store_release()) on that same thread, and finally the
“rfe” gets us back to the original thread.
(Of course, if it weren't for the intersection with “int”,
that “rfe” might instead take us to a third thread.)
The “ppo”, “strongly-hb”, and
“obs” relationships have provided us with
casual relationships involving at most two threads.
Causal relationships can extend over an arbitrarily large number of
thread (in theory, anyway), and the purpose of the recursive definition
of “hb” (happens-before) spanning lines 8-12
is exactly to extend causality, but in a way supported by all architectures
that run Linux.
It is best to start with the base case on lines 11 and 12,
which union the “strongly-hb” relationship
(optionally preceded by an indefinitely long series of address
dependencies) with the “obs” relationship.
This results in all of the orderings involving a pair of threads.
Then line 8's “transitive-propbase” (transitive
propagation base) relationship works backwards in time, adding
an optional cross-thread reads-from (“rfe”)
relationship and a transitive fence to an existing series of
“hb” relationships.
This existing series is permitted to be empty, so that
“rfe? ; transitive-fence” is a base case
for “transitive-propbase” (but not for
“hb”).
Lines 9 and 10 define “transitive-obs”
(transitive observation) relationship.
Because this is unioned directly into “hb”,
it forms another base case in combination with
“transitive-propbase”, namely
“((coe|fre);barrier*;rfe;transitive-fence)&int”.
This base case is roughly similar to the
“obs” relationship defined on line 6,
hence the similar name.
The inductive case is an arbitrarily long sequence of causally related
events that begin and end on the same thread.
Putting the “transitive-propbase”,
“transitive-obs”, and “hb”
relationships together (recursively!), we get an arbitrarily long
causal happens-before relationship.
Line 14 says that these causal relationships cannot form cycles,
which should be intuitively appealing to anyone who does not possess
a time machine.
Even in weakly ordered systems, ordering extends somewhat beyond
strict causality, for example, it includes the notion of coherence
points.
The corresponding ordering constraints are described below.
1 let propbase = barrier | transitive-propbase
2 let strong-prop = fre? ; propbase* ; rfe? ; strong-fence ; hb*
3 let prop = (transitive-propbase & (W*W)) | strong-prop
4 let atomic-hb = hb+ & ((RMW&W) * _)
5 let cpord = co | prop | atomic-hb
6
7 acyclic cpord as propagation
Line 1 defines “propbase” (propagation base).
This can be either some sort of memory barrier (“barrier”)
or a “transitive-propbase”
an arbitrarily long (including zero-length) series of happens-before
relationships that begins with a transitive fence
(and that is optionally preceded by an external reads-from relationship).
Line 2 defines “strong-prop” (strong propagation),
which adds a strong fence (that is, either smp_mb() or
synchronize_rcu()), and optionally much else besides to a
(possibly empty) series of “propbase” relationships.
Next, line 3 defines “prop” (propagation),
which augments “strong-prop” with
“transitive-propbase”, but restricted to begin
and end with a write operation.
Next, line 4 folds in the beginnings of support for atomic RMW operations
by defining the “atomic-hb” (atomic happens-before)
relationship.
This relationship is any non-zero-length series of happens-before
relations ships where the first operation is the write portion of
an atomic RMW operation.
The stage is then set for line 5 to define the “cpord”
(control-point order) relationship, which is just the union of
the coherence, propagation, and atomic-happens-before relationships.
Line 7 then requires that this relationship be acyclic.
This should be intuitively appealing to hardware architects who do not possess
a time machine.
The happens-before and coherence-points machinery can be complex, but
fortunately, many common use cases take simple paths through this
happens-before machinery, for example:
Strong Model Litmus Test #12
1 C C-ISA2+o-rel+acq-rel+acq-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_store_release(b, 1);
10 }
11
12 P1(int *b, int *c)
13 {
14 int r1;
15
16 r1 = smp_load_acquire(b);
17 smp_store_release(c, 1);
18 }
19
20 P2(int *c, int *a)
21 {
22 int r2;
23 int r3;
24
25 r2 = smp_load_acquire(c);
26 r3 = READ_ONCE(*a);
27 }
28
29 exists
30 (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
After all three threads have completed, is the outcome shown
on line 30 possible?
Referring to the bell file, we see that lines 8⟶9, 16⟶17,
and 25⟶26
are each examples of the “transitive-weak-fence”
relationship.
Other bell-file definitions mean that all three are also examples of the
“weak-fence”,
“transitive-fence”,
“barrier”, and
“fence” relationships.
In addition, lines 9⟶16 and 17⟶25 are examples of the
“rfe” relationship.
Finally, lines 26⟶8 is an example of the
“fre” relationship.
Referring now to the cat file, we see that lines 8⟶9,
16⟶17, 25⟶26,
9⟶16, and 17⟶25
are each examples of the “strongly-hb”
relationship, by virtue of them being examples of either the
“fence” or “rfe”
relationships.
This in turn means that each of these relationships are examples of
the “hb” base case.
Let's call 8⟶9 a “transitive-fence” and each
of 16⟶17, 25⟶26, 9⟶16, and 17⟶25
an “hb”.
Then the series
8⟶9⟶16⟶17⟶25⟶26
is an example of
“transitive-propbase”.
Now let's look at “transitive-obs”,
ignoring the possibly-empty “hb*” and
“barrier*” components of that relationship.
This leaves us with
“(coe|fre);transitive-propbase”.
Now 26⟶8 is an “fre”, so given that
8⟶9⟶16⟶17⟶25⟶26
is a “transitive-propbase”,
and given that the series
26⟶8⟶9⟶16⟶17⟶25⟶26
begins and ends in the
same process, this series is a “transitive-obs”.
Because “transitive-obs” can be an
“hb” and “hb” must be acyclic,
the cycle
26⟶8⟶9⟶16⟶17⟶25⟶26
is forbidden.
This is confirmed by running the command:
herd7 -macros linux.def -bell strong-kernel.bell -cat strong-kernel.cat C-ISA2+o-rel+acq-rel+acq-o.litmus
Which produces the following output:
Test C-ISA2+o-rel+acq-rel+acq-o Allowed
States 7
1:r1=0; 2:r2=0; 2:r3=0;
1:r1=0; 2:r2=0; 2:r3=1;
1:r1=0; 2:r2=1; 2:r3=0;
1:r1=0; 2:r2=1; 2:r3=1;
1:r1=1; 2:r2=0; 2:r3=0;
1:r1=1; 2:r2=0; 2:r3=1;
1:r1=1; 2:r2=1; 2:r3=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
Observation C-ISA2+o-rel+acq-rel+acq-o Never 0 7
Hash=9762857b08e4db85dbbf52a7b43068e9
The “
Never 0 7” should be reassuring, given that
this cycle is analogous a series of lock releases and acquires, which
had jolly well better be fully ordered!
Let's now look at a roughly similar example:
Strong Model Litmus Test #13
1 C C-W+WRC+o-rel+acq-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_store_release(b, 1);
10 }
11
12 P1(int *b, int *c)
13 {
14 int r1;
15 int r2;
16
17 r1 = smp_load_acquire(b);
18 r2 = READ_ONCE(*c);
19 }
20
21 P2(int *c, int *a)
22 {
23 int r3;
24
25 WRITE_ONCE(*c, 1);
26 smp_mb();
27 r3 = READ_ONCE(*a);
28 }
29
30 exists
31 (1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
After all three threads have completed, is the result
shown on line 31 possible?
Referring again to the bell file, we see that lines 8⟶9
and 17⟶18
are both examples of the “transitive-weak-fence”
relationship.
Other bell-file definitions mean that both are also examples of the
“weak-fence”,
“transitive-fence”,
“barrier”, and
“fence” relationships.
Also, lines 25⟶27 is an example of the
“strong-fence” relationship, which in turn means
that it is an example of the
“barrier” and
“fence” relationships.
Finally, lines 9⟶17 is an example of the
“rfe” relationship and
lines 18⟶25 and 27⟶8 are examples of the
“fre” relationship.
Switching out attention to the cat file, we see that lines 8⟶9,
9⟶17, 17⟶18, and 25⟶27 are all examples of the
“strongly-hb” relationship by virtue of
being examples of either the
“fence” or “rfe” relationships.
However, our previous strategy using the
“transitive-obs” relationship fails
because it only allows a single “fre” relationship
in a series that is required to begin and end on the same thread, and
we instead have two “fre” relationships, both of
which must be traversed to get back to the same thread.
Therefore, our “hb” relationships cover only the
series 8⟶9⟶17⟶18 and the series 25⟶27,
with no way to connect the two.
The “strong-prop” relationship is another
possibility, but it again only allows for a single
“fre” relationship.
There are no “rmw” relationships, so the
“atomic” check cannot help, either.
The “rdw” relationship requires a
“(fre;rfe)” series that does not exist in this
litmus test.
Finally, “obs” allows only a single
“fre” relationship in a series that begins and
ends on the same thread.
Therefore, the outcome r1&&!r2&&!r3 really can
happen, which will be no surprise
to anyone who has heard that powerpc's locks do not provide global ordering.
After all, this litmus test can be thought of as modeling
P0() releasing a lock, P1() acquiring it, and
P2() being an external observer checking for misordering
of P0()'s write to a with P1()'s
read from c.
In addition, “fre” relationships are non-causal,
so it makes sense that they can play only a limited role in the forbidding
of cycles.
In contrast, “rfe” relationships are causal,
and thus much more likely to result in forbidden cycles.
And this can be confirmed by running the following command line:
herd7 -macros linux.def -bell strong-kernel.bell -cat strong-kernel.cat C-W+WRC+o-rel+acq-o+o-mb-o.litmus
Which results in the following output:
Test C-W+WRC+o-rel+acq-o+o-mb-o Allowed
States 8
1:r1=0; 1:r2=0; 2:r3=0;
1:r1=0; 1:r2=0; 2:r3=1;
1:r1=0; 1:r2=1; 2:r3=0;
1:r1=0; 1:r2=1; 2:r3=1;
1:r1=1; 1:r2=0; 2:r3=0;
1:r1=1; 1:r2=0; 2:r3=1;
1:r1=1; 1:r2=1; 2:r3=0;
1:r1=1; 1:r2=1; 2:r3=1;
Ok
Witnesses
Positive: 1 Negative: 7
Condition exists (1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
Observation C-W+WRC+o-rel+acq-o+o-mb-o Sometimes 1 7
Hash=8e3c5d7d5d36f2b1484ff237e8d22f91
However, full barriers (smp_mb()) can be used to force the
Linux kernel to respect full non-causal ordering, and this is the
main job of the “cpord” relationship.
To see this, consider the following store-buffering litmus test
shown in
Strong Model Litmus Test #1.
Can the cyclic outcome “!r1&&!r2”
called out in line 25 really happen?
Lines 10⟶12 and 19⟶21 are both examples of the
“strong-fence” relationship, which in turn means
that they are also examples of the
“barrier” and
“fence” relationships.
Lines 12⟶19 and 21⟶10 are examples of the
“fre”
relationships.
As noted earlier, the fact that you have to go through two
“fre” relationships to get back to the original
thread means that the
“hb” relationship does not apply.
We therefore need to look to the “cpord” relationship.
Quick Quiz 8:
Is there an easy way to tell which definitions have effect for a
given litmus test?
Answer
Omitting most of the optional elements from the
“strong-prop” relationship results in the following:
“fre?;strong-fence”, a relationship that has
as members the pair of series 10⟶12⟶19 and
19⟶21⟶10.
Any “strong-prop” relationship is also a
“prop” relationship and a
“cpord” relationship.
However, the “cpord” relationship is required to
be acyclic, that is, no matter how you string together
“cpord” relations, there must not be a cycle.
Given that stringing together 10⟶12⟶19 and
19⟶21⟶10 results in the cycle
10⟶12⟶19⟶21⟶10,
we are forced to conclude that the store-buffering
litmus test's cyclic outcome “!r1&&!r2”
is forbidden.
Quick Quiz 9:
Why does “cpord” prohibit a cycle containing two
“fre” relationships when “hb”
does not?
They are both acyclic, after all!
Answer
The previous section showed how smp_mb() can restore
sequential consistency.
However, as Jade noted, synchronize_rcu() is even stronger
still, and therefore requires even more cat-file code.
The final portion of the cat file therefore covers RCU relationships.
Quick Quiz 10:
Say what???
How can anything possibly be stronger than sequential consistency???
Answer
RCU's fragment of the cat file is as follows:
1 (* Propagation between strong fences *)
2 let basic = hb* ; cpord* ; fre? ; propbase* ; rfe?
3
4 (* Chains that can prevent the RCU guarantee *)
5 let s-link = sync ; basic
6 let c-link = po ; crit^-1 ; po ; basic
7 let rcu-path0 = s-link |
8 (s-link ; c-link) |
9 (c-link ; s-link)
10 let rec rcu-path = rcu-path0 |
11 (rcu-path ; rcu-path) |
12 (s-link ; rcu-path ; c-link) |
13 (c-link ; rcu-path ; s-link)
14
15 irreflexive rcu-path as rcu
Line 2 defines the “basic” relationship,
which orders smp_mb(), synchronize_rcu(), and
RCU read-side critical sections.
The ordering of smp_mb() is handled implicitly within
the definitions of the component relationships that have been
unioned together to form “basic”.
Quick Quiz 11:
But “basic” could be the empty relationship,
so that it would directly connect what preceded it with what followed it.
How can that be right?
Answer
Line 5 defines the “s-link”
(synchronize_rcu() link) relationship to be an RCU
grace period followed by some sequence of operations that provides
ordering.
Similarly, line 6 defines the “c-link”
(critical-section link) relationship to be an RCU read-side critical section
followed by some sequence of operations that provides ordering.
However, the formulation for “c-link” is interesting
in that it allows any access preceding an RCU read-side critical section
in that same thread to be used as evidence that an earlier grace period
is ordered before the critical section, and vice versa.
The importance of this is shown by the following litmus test:
Strong Model Litmus Test #14
1 C C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 synchronize_rcu();
12 WRITE_ONCE(*b, 1);
13 }
14
15 P1(int *b, int *c)
16 {
17 int r2;
18
19 rcu_read_lock();
20 r2 = READ_ONCE(*b);
21 WRITE_ONCE(*c, 1);
22 rcu_read_unlock();
23 }
24
25 P2(int *c, int *d)
26 {
27 int r3;
28
29 r3 = READ_ONCE(*c);
30 rcu_read_lock();
31 // do_something_else();
32 rcu_read_unlock();
33 WRITE_ONCE(*d, 1);
34 }
35
36 P3(int *d, int *a)
37 {
38 int r4;
39
40 r4 = READ_ONCE(*d);
41 synchronize_rcu();
42 WRITE_ONCE(*a, 1);
43 }
44
45 exists
46 (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 3:r4=1)
The normal usage of “c-link” is illustrated by
P1().
The “c-link” definition could start at line 20,
take a “po” step to the rcu_read_unlock() on
line 13, step back to the rcu_read_lock() on line 19,
and finally a “po” step to line 21.
This implements the rule: “If any part of an RCU read-side
critical section follows anything after a given RCU grace period,
then the entirety of that critical section follows anything preceding
that grace period”, where the preceding grace period is the
one in P0().
The more expansive usage is illustrated by P2().
The “c-link” definition could start at line 29,
take a “po” step to the
rcu_read_unlock() on line 32, then a
“crit^-1”, step back to the rcu_read_lock()
on line 30, and finally a “po” step to
line 33.
This allows “c-link” (in conjunction with
“basic”) to link the access on
line 21 of P1() with the access on line 40
of P3().
Without this more expansive definition of “c-link”,
the questionable outcome
r1&&r2&&r3&&r4 is permitted, which
it is not, as can be seen by running:
herd7 -macros linux.def -bell strong-kernel.bell -cat strong-kernel.cat \
C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o.litmus
This gives the reassuring output:
Test C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o Allowed
States 15
0:r1=0; 1:r2=0; 2:r3=0; 3:r4=0;
0:r1=0; 1:r2=0; 2:r3=0; 3:r4=1;
0:r1=0; 1:r2=0; 2:r3=1; 3:r4=0;
0:r1=0; 1:r2=0; 2:r3=1; 3:r4=1;
0:r1=0; 1:r2=1; 2:r3=0; 3:r4=0;
0:r1=0; 1:r2=1; 2:r3=0; 3:r4=1;
0:r1=0; 1:r2=1; 2:r3=1; 3:r4=0;
0:r1=0; 1:r2=1; 2:r3=1; 3:r4=1;
0:r1=1; 1:r2=0; 2:r3=0; 3:r4=0;
0:r1=1; 1:r2=0; 2:r3=0; 3:r4=1;
0:r1=1; 1:r2=0; 2:r3=1; 3:r4=0;
0:r1=1; 1:r2=0; 2:r3=1; 3:r4=1;
0:r1=1; 1:r2=1; 2:r3=0; 3:r4=0;
0:r1=1; 1:r2=1; 2:r3=0; 3:r4=1;
0:r1=1; 1:r2=1; 2:r3=1; 3:r4=0;
No
Witnesses
Positive: 0 Negative: 15
Condition exists (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 3:r4=1)
Observation C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o Never 0 15
Hash=c792a4c620a9d5244c0bee80da2a90fa
In short, if anything within or preceding a given RCU read-side critical
section follows anything after a given RCU grace period, then it is
probably best if that entire RCU read-side critical section follows
anything preceding the grace period, and vice versa.
Lines 7-9 of RCU's cat-file fragment
define “rcu-path0”
(RCU path base case) relationship to be the three basic ways that
RCU provides ordering:
- A single synchronize_rcu() invocation, which
in theory may be substituted for smp_mb().
(In practice, good luck with instances of smp_mb()
in preempt-disabled regions of code, to say nothing of the
disastrous degradation of performance.)
- A synchronize_rcu() that is ordered before an
RCU read-side critical section.
This commonly used case guarantees that if some RCU read-side
critical section extends beyond the end of a grace period,
then all of that RCU read-side critical section happens after
anything preceding that grace period.
In other words, if any part of the critical section might happen
after the kfree(), all of that critical section will
happen after the corresponding list_del_rcu().
This case groups the RCU grace period in P0()
and the RCU read-side critical section in P1()
in the example above.
- An RCU read-side critical section that is ordered before a
synchronize_rcu().
This commonly used case guarantees that if some RCU read-side
critical section extends before the beginning of a grace period,
then all of that RCU read-side critical section happens before
anything following that grace period.
In other words, if any part of the critical section might happen
before the list_del_rcu(), all of that critical section will
happen before the corresponding the kfree().
This case groups the the RCU read-side critical section in
P2() and RCU grace period in P3()
in the example above.
The recursive definition of “rcu-path” on lines 10-13
builds on “rcu-path0”.
The “rcu-path0” on line 10 supplies the base
case.
Line 11's “(rcu-path;rcu-path)” states that
if any two sequences of RCU grace periods and read-side critical sections
provide ordering, then the concatenation of those two sequences also
provides ordering, and applies to the P0()-P1()
and P2()-P3() groups in the example above,
thus guaranteeing that the questionable outcome
r1&&r2&&r3&&r4 is forbidden.
Line 12's “(s-link;rcu-path;c-link)” states
that if some sequence of RCU grace periods and read-side critical sections
provides ordering, then ordering is still provided when that sequence
is preceded by synchronize_rcu() and followed by an RCU
read-side critical section.
Finally, line 13's “(c-link;rcu-path;s-link)” states
that if some sequence of RCU grace periods and read-side critical sections
provides ordering, then ordering is still provided when that sequence
is preceded by an RCU read-side critical section and followed by
synchronize_rcu().
Line 15 states that “rcu-path” cannot loop back
on itself, in other words, that “rcu-path”
provides ordering.
Another way of thinking of “rcu-path” is of a counter
and comparison, implemented recursively.
If there are at least as many calls to synchronize_rcu()
as there are RCU read-side critical sections in a given
“rcu-path”, ordering is guaranteed, otherwise not.
Let's use this machinery to analyze the prototypical RCU-deferred-free
scenario:
Strong Model Litmus Test #15
1 C C-LB+rl-deref-o-rul+o-sync-o.litmus
2
3 {
4 a=x;
5 }
6
7 P0(int **a)
8 {
9 int *r1;
10 int r2;
11
12 rcu_read_lock();
13 r1 = rcu_dereference(*a);
14 r2 = READ_ONCE(*r1);
15 rcu_read_unlock();
16 }
17
18 P1(int **a, int *x, int *y)
19 {
20 WRITE_ONCE(*a, y);
21 synchronize_rcu();
22 WRITE_ONCE(*x, 1); /* Emulate kfree(). */
23 }
24
25 exists
26 (0:r1=x /\ 0:r2=1)
The variable a initially references the variable x,
which is initially zero.
The P1() function sets variable y to reference
the variable y (also initially zero), then sets the value
of x to 1 to emulate the effects of kfree().
Any RCU reader accessing and dereferencing a should therefore
see the value zero, in other words, the outcome r2 should
be forbidden.
In other words, we would expect the cycle
20⟶22⟶14⟶15⟶12⟶13⟶20
to be forbidden.
Let's check!
Lines 12⟶15 is a
“crit” relationship, while
lines 20⟶22 is a “sync” relationship.
If the cycle is allowed,
Lines 13⟶20 form an “fre” relationship
and lines 22⟶14 form an “rfe” relationship.
This means that lines 13⟶20 and lines 22⟶14 are also
“basic” relationships.
This means that the series 20⟶22⟶14 is a
“s-link”
relationship.
Given that lines 14⟶15 and 12⟶13 are
“po” relationships,
the series 14⟶15⟶12⟶13⟶20 is a
“c-link” relationship.
We therefore have an “s-link” relationship
followed by a “c-link” (or vice versa), so
that the series
20⟶22⟶14⟶15⟶12⟶13⟶20
is an “rcu-path0”
relationship, which means that this same series is also an
“rcu-path” relationship.
Because it ends where it starts, on line 20, it is reflexive,
and thus forbidden.
The following command confirms this:
herd7 -macros linux.def -bell strong-kernel.bell -cat strong-kernel.cat C-LB+rl-deref-o-rul+o-sync-o.litmus
This command produces the following output:
Test C-LB+rl-deref-o-rul+o-sync-o Allowed
States 2
0:r1=x; 0:r2=0;
0:r1=y; 0:r2=0;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (0:r1=x /\ 0:r2=1)
Observation C-LB+rl-deref-o-rul+o-sync-o Never 0 2
Hash=0c483dc427960c11ac9395e4282a41d7
Therefore, the RCU read-side critical section in P0()
cannot see the emulated kfree() following P1()'s
grace period, which should be some comfort to users of RCU.
But suppose we add another RCU read-side critical section to the mix,
in the following somewhat inane but hopefully instructive example?
Strong Model Litmus Test #16
1 C C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu.litmus
2
3 {
4 a=x;
5 }
6
7 P0(int **a)
8 {
9 int *r1;
10 int r2;
11
12 rcu_read_lock();
13 r1 = rcu_dereference(*a);
14 r2 = READ_ONCE(*r1);
15 rcu_read_unlock();
16 }
17
18 P1(int **a, int *y, int *z)
19 {
20 WRITE_ONCE(*a, y);
21 synchronize_rcu();
22 WRITE_ONCE(*z, 1);
23 }
24
25 P2(int *x, int *z)
26 {
27 int r3;
28
29 rcu_read_lock();
30 r3 = READ_ONCE(*z);
31 WRITE_ONCE(*x, 1); /* Emulate kfree(). */
32 rcu_read_unlock();
33 }
34
35 exists
36 (0:r1=x /\ 0:r2=1 /\ 2:r3=1)
Can the outcome r2 happen now?
Lines 12⟶15 and 29⟶32 are
“crit” relationships, while
Lines 20⟶22 is a “sync” relationship.
Lines 22⟶30 and 31⟶14 are “rfe”
relationships and lines 13⟶20 are an “fre”,
which means that all are also
“basic” relationships.
This means that the series 20⟶22⟶30 is a
“s-link”
relationship.
Given that lines 14⟶15 and 12⟶13 are
“po” relationships,
the series
14⟶15⟶12⟶13⟶20 is a “c-link”
relationship.
Similarly, because lines 30⟶32 and 29⟶31 are
“po” relationships,
the series 30⟶32⟶29⟶31⟶14
is also a “c-link”
relationship.
We therefore have one “c-link” relationship
followed by a “s-link” relationship, which in
turn is followed by another “c-link” relationship.
The “c-link” relationship
14⟶15⟶12⟶13⟶20
can combine with the
“s-link” relationship
20⟶22⟶30
to form the “rcu-path0” relationship
14⟶15⟶12⟶13⟶20⟶22⟶30.
However, there is no way to add the remaining “c-link”
relationship 30⟶32⟶29⟶31⟶14,
so the cycle resulting in
r2 can in fact happen.
This is confirmed by the command:
herd7 -macros linux.def -bell strong-kernel.bell -cat strong-kernel.cat \
C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu.litmus
Which produces the output:
Test C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu Allowed
States 6
0:r1=x; 0:r2=0; 2:r3=0;
0:r1=x; 0:r2=0; 2:r3=1;
0:r1=x; 0:r2=1; 2:r3=0;
0:r1=x; 0:r2=1; 2:r3=1;
0:r1=y; 0:r2=0; 2:r3=0;
0:r1=y; 0:r2=0; 2:r3=1;
Ok
Witnesses
Positive: 1 Negative: 5
Condition exists (0:r1=x /\ 0:r2=1 /\ 2:r3=1)
Observation C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu Sometimes 1 5
Hash=b591d622245952a2fc8eaad233203817
This should be no surprise, given that we have more RCU read-side
critical sections than we have grace periods.
This situation underscores the need to avoid doing inane things with RCU.
However, one nice thing about the fact that the memory model incorporates
RCU is that such inanity can now be detected, at least when it is
confined to relatively small code fragments.
Acknowledgments
We owe thanks to H. Peter Anvin, Will Deacon, Andy Glew,
Derek Williams, Leonid Yegoshin, and Peter Zijlstra for their
patient explanations of their respective systems' memory models.
We are indebted to Peter Sewell, Sumit Sarkar, and their groups
for their seminal work formalizing many of these same memory models.
We all owe thanks to Dmitry Vyukov, Boqun Feng, and Peter Zijlstra for
their help making this human-readable.
We are also grateful to Michelle Rankin and Jim Wasko for their support
of this effort.
This work represents the views of the authors and does not necessarily
represent the views of University College London, INRIA Paris,
Scuola Superiore Sant'Anna, Harvard University, or IBM Corporation.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or
service marks of others.
Quick Quiz 1:
Given that this is about memory barriers, why
“instructions F[Barriers]” instead of perhaps
“instructions B[Barriers]”?
Answer:
“Memory barriers” are also sometimes called
“memory fences”.
This can be confusing, but both terms are used so we might
as well get used to it.
Besides, the “B” instruction class
was already reserved for Branches.
Back to Quick Quiz 1.
Quick Quiz 2:
Why wouldn't “let sync = fencerel(Sync)”
work just as well as the modified definition?
Answer:
The modified definition is necessary because the model needs to recognize that
code like:
WRITE_ONCE(*x, 1);
synchronize_rcu();
synchronize_rcu();
r2 = READ_ONCE(*y);
will insert two grace periods between the memory accesses, not just one.
With the modified definition, there is a “sync”
pair linking the
WRITE_ONCE() to the first synchronize_rcu() as well as
a pair linking that event to the READ_ONCE(),
so it is possible to pass from the write to the read via two links.
With the “let sync = fencerel(Sync)” definition,
there would be no link from the WRITE_ONCE() to the
first synchronize_rcu().
Consequently there would be a path from the write to the read
involving one link, but no path involving two.
Back to Quick Quiz 2.
Quick Quiz 3:
This strong model is insanely complex!!!
How can anyone be expected to understand it???
Answer:
Given that this model is set up to be as strong as reasonably possible given
the rather wide variety of memory models that the Linux kernel runs
on, it is actually surprisingly simple.
Furthermore, this model has a tool that goes with it, which is more
than can be said of memory-barriers.txt.
Nevertheless, it is quite possible that this model should be carefully
weakened, if it turns out that doing so simplifies the model
without invalidating any use cases.
Any such weakening should of course be carried out with extreme caution.
Back to Quick Quiz 3.
Quick Quiz 4:
For what code would this distinction matter?
Answer:
One example is as follows:
p = READ_ONCE(gp);
do_something_with(p->a);
DEC Alpha would not provide ordering in this case, and the
definition of “rdep” therefore excludes this case.
Back to Quick Quiz 4.
Quick Quiz 5:
Why would an operation following an address dependency get any special
treatment?
After all, there does not appear to be any particular ordering relationship
in the general case.
Answer:
It turns out that Power guarantees that writes following an
address-dependency pair are guaranteed not to be reordered before
the load heading up the dependency pair, as can be seen from this
load-buffering litmus test
and its output (note the
“Never” on the last lint) and from this
message-passing litmus test
and its output.
Why would Power provide such ordering to an unrelated store?
Because until the load completes, Power has no idea whether or not it
is unrelated.
If the load returns the same address that is used by the “unrelated”
store, then the two stores are no longer unrelated, and the CPU must
provide coherence ordering between them.
But the CPU can't know what ordering requirements there might be until
the load completes, so all later writes must wait until the load completes.
But what about loads?
Don't they have the same coherency requirements?
Indeed they do, but the CPU can safely speculate such loads, squashing the
speculation if it later learns that there was an unexpected address
collision.
For more information on this dependency/coherence corner case, please see
section 10.5 of
A Tutorial Introduction to the ARM and POWER Relaxed Memory Models.
Other sections cover many other interesting corner cases.
There is also the possibility that the compiler might know all the values
assigned to the variable loaded via rcu_dereference() or
lockless_dereference().
Back to Quick Quiz 5.
Quick Quiz 6:
Why does “ppo” intersect “po-loc”
with “(_*W)”?
Don't we need to enforce full cache coherence, not just cache coherence
for trailing writes?
Answer:
We do need to enforce full cache coherence, but that has already been
done, see the “coherence-order” relationship
discussed earlier.
What “ppo” is adding is the memory-order interactions
between multiple variables and multiple threads.
Back to Quick Quiz 6.
Quick Quiz 7:
Why do rcu_dereference() and
lockless_dereference() respect control dependencies?
Answer:
Modern hardware is not permitted to speculate stores, so any
well-formed compiler-proof conditional will respect control
dependencies, including those involving
rcu_dereference() and lockless_dereference()
as well as those involving READ_ONCE().
Back to Quick Quiz 7.
Quick Quiz 8:
Is there an easy way to tell which definitions have effect for a
given litmus test?
Answer:
One very straightforward approach is to edit the .cat and .bell files
to remove “acyclic” or
“irreflexive” statements.
For example, for the above store-buffering litmus test, removing
the “acyclic cpord as propagation” allows
the cyclic outcome.
Alternatively, you can pass the
“-skipcheck propagation” argument-line argument to
herd7.
However, editing the .bell and .cat files to omit different elements
can be an extremely educational activity.
Back to Quick Quiz 8.
Quick Quiz 9:
Why does “cpord” prohibit a cycle containing two
“fre” relationships when “hb”
does not?
They are both acyclic, after all!
Answer:
The difference is that “hb” requires that any
path including an “fre” relationship begin and
end at the same thread.
Therefore, no matter how you string “hb”
relationships together, they cannot prohibit a cycle that goes
through two “fre” relationship before returning
to the original thread, and thus cannot prohibit the store-buffering
litmus test.
In contrast, the “strong-prop” relationship that
leads up to the “cpord” relationship makes no
same-thread restriction, which means that “cpord”
can forbid a cycle containing more than one “fre”
relationship.
Back to Quick Quiz 9.
Quick Quiz 10:
Say what???
How can anything possibly be stronger than sequential consistency???
Answer:
Easily.
To see this, recall the store-buffering example from the previous section,
in which smp_mb() prevented any executions that were not
simple interleavings, in other words, it prohibits the cyclic outcome
“!r1&&!r2”.
If we replace the first smp_mb() with synchronize_rcu(),
replace the second smp_mb() with with an RCU read-side
critical section, and reverse P1()'s memory references,
we get the following:
Strong Model Litmus Test #17
1 C C-LB+o-sync-o+rl-o-o-rul.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 synchronize_rcu();
12 WRITE_ONCE(*b, 1);
13 }
14
15 P1(int *b, int *a)
16 {
17 int r2;
18
19 rcu_read_lock();
20 r2 = READ_ONCE(*b);
21 WRITE_ONCE(*a, 1);
22 rcu_read_unlock();
23 }
24
25 exists
26 (0:r1=1 /\ 1:r2=1)
It turns out that synchronize_rcu() is so strong that it
is able to forbid the cyclic outcome “r1&&r2”
even though P1() places no ordering constraints whatsoever
on its two memory references.
Now that is strong ordering!
There is of course no free lunch.
On systems having more than one CPU, the overhead of
synchronize_rcu() is orders of magnitude greater than that of
smp_mb().
You get what you pay for!
Back to Quick Quiz 10.
Quick Quiz 11:
But “basic” could be the empty relationship,
so that it would directly connect what preceded it with what followed it.
How can that be right?
Answer:
It is not just right, but absolutely necessary.
This permits a pair of consecutive grace periods to do the right thing.
For example, consider the following litmus test, where, as usual,
a, b, and c are initially all zero:
Strong Model Litmus Test #18
1 C C-LB+o-sync-sync-o+rl-o-o-rul+rl-o-o-rul.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 synchronize_rcu();
12 synchronize_rcu();
13 WRITE_ONCE(*b, 1);
14 }
15
16 P1(int *b, int *c)
17 {
18 int r2;
19
20 rcu_read_lock();
21 r2 = READ_ONCE(*b);
22 WRITE_ONCE(*c, 1);
23 rcu_read_unlock();
24 }
25
26 P2(int *c, int *a)
27 {
28 int r3;
29
30 rcu_read_lock();
31 r3 = READ_ONCE(*c);
32 WRITE_ONCE(*a, 1);
33 rcu_read_unlock();
34 }
35
36 exists
37 (0:r1=1 /\ 1:r2=1 /\ 2:r3=1)
If “basic” did not permit an empty relationship,
the pair of synchronize_rcu() invocations on lines 4 and 5
would not be serialized, but would instead effectively merge into a
single synchronize_rcu().
Thus, the possibility of an empty “basic” is
absolutely required to forbid the undesirable outcome
r1&&r2&&r3.
Back to Quick Quiz 11.