[Valgrind-users] helgrind and atomic operations

Discussion:

David Faure

2012-08-25 08:45:12 UTC

How do I tell helgrind that some atomic operations on integers are OK?

A suppression like this works, but it would make helgrind more useful to all Qt users if either atomic operations were handled automatically, or if this was added to valgrind's own suppressions.

{
qt5_basic_atomic_integer
Helgrind:Race
fun:_ZNK19QBasicAtomicIntegerIiE4loadEv
}

Also, this only works for integers, not for pointers...

==8489== Possible data race during read of size 8 at 0x8B9C968 by thread #3
==8489== at 0x57549C0: QBasicAtomicPointer<QTextCodec>::loadAcquire() const (qgenericatomic.h:110)
==8489== by 0x5753A26: QTextCodec::codecForLocale() (qtextcodec.cpp:683)
==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959)
==8489==
==8489== This conflicts with a previous write of size 8 by thread #2
==8489== at 0x57549A2: QBasicAtomicPointer<QTextCodec>::storeRelease(QTextCodec*) (qgenericatomic.h:119)
==8489== by 0x5757D5D: QIcuCodec::defaultCodecUnlocked() (qicucodec.cpp:441)
==8489== by 0x5753A43: QTextCodec::codecForLocale() (qtextcodec.cpp:687)
==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959)

Can't make suppressions for these, given that the template class could use any type
(here QTextCodec).

The implementation of loadAcquire/storeRelease is compiler and platform dependent.
With C++11 it uses std::atomic, with gcc (in non-c++11 mode) it uses some __sync_synchronize stuff,
on ARM there's a bit of assembly, etc.

I presume all this is too low-level for helgrind? I.e. can it actually detect atomic operations
and therefore validate Qt's implementation, or should we rather just "trust Qt" and aim for
silencing helgrind whenever it sees QBasicAtomic{Integer,Pointer}?

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5

Bart Van Assche

2012-08-25 10:30:45 UTC

Permalink

Post by David Faure
How do I tell helgrind that some atomic operations on integers are OK?
A suppression like this works, but it would make helgrind more useful to all Qt users if either atomic operations were handled automatically, or if this was added to valgrind's own suppressions.
{
qt5_basic_atomic_integer
Helgrind:Race
fun:_ZNK19QBasicAtomicIntegerIiE4loadEv
}
Also, this only works for integers, not for pointers...
==8489== Possible data race during read of size 8 at 0x8B9C968 by thread #3
==8489== at 0x57549C0: QBasicAtomicPointer<QTextCodec>::loadAcquire() const (qgenericatomic.h:110)
==8489== by 0x5753A26: QTextCodec::codecForLocale() (qtextcodec.cpp:683)
==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959)
==8489==
==8489== This conflicts with a previous write of size 8 by thread #2
==8489== at 0x57549A2: QBasicAtomicPointer<QTextCodec>::storeRelease(QTextCodec*) (qgenericatomic.h:119)
==8489== by 0x5757D5D: QIcuCodec::defaultCodecUnlocked() (qicucodec.cpp:441)
==8489== by 0x5753A43: QTextCodec::codecForLocale() (qtextcodec.cpp:687)
==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959)
Can't make suppressions for these, given that the template class could use any type
(here QTextCodec).
The implementation of loadAcquire/storeRelease is compiler and platform dependent.
With C++11 it uses std::atomic, with gcc (in non-c++11 mode) it uses some __sync_synchronize stuff,
on ARM there's a bit of assembly, etc.
I presume all this is too low-level for helgrind? I.e. can it actually detect atomic operations
and therefore validate Qt's implementation, or should we rather just "trust Qt" and aim for
silencing helgrind whenever it sees QBasicAtomic{Integer,Pointer}?

The Qt authors will have to modify the Qt library somewhat before it
will become convenient to analyze Qt programs with Helgrind or DRD. See
also this bug report (created three years ago):
http://bugreports.qt-project.org/browse/QTBUG-5655.

Bart.

Julian Seward

2012-08-25 20:43:58 UTC

Permalink

Post by David Faure
How do I tell helgrind that some atomic operations on integers are OK?
[...]

Without looking in more detail at this, it is difficult to tell if that
is really the right question to ask.

Helgrind treats atomic operations (atomic read-modify-write) as if they
were simply reads, both for x86/x86_64, and similarly for LL/SC pairs on
ARM, PPC, MIPS, etc. So it should not report races on code that uses
atomic operations.

Now, that said, a reason it might be reporting races like this is
that Qt is using the atomic ops to create inter-thread synchronisation
that Helgrind cannot see, hence it reports a race where there is none.

Or maybe Qt really is racey -- perhaps harmlessly. In which case H is
correct to complain.

Without looking further, it's not really possible to say what's happening.
I would be happy to do that if you make it easy (viz, send commands
starting with "wget qt5.tarball" etc) to build a simple test case.

One way to hide the complaints is use
VALGRIND_{DISABLE,ENABLE}_ERROR_REPORTING
around the affected areas. Read the docs in valgrind.h about that.

Or you can use VALGRIND_HG_{DISABLE,ENABLE}_CHECKING on the address
range (in helgrind.h) to tell H not to check the particular address
range. That's quite dangerous though because you need be sure to
re-enable checking on the range when the memory is freed.

Both of those approaches are basically bad though, because they
might hide real bugs, both in Qt and in apps linked against it.
Better to investigate what is going on here, fix any bugs found, and
mark up Qt to notify Helgrind/DRD of any inter-thread synchronisation
that the normal pthread intercepts cannot see. Once that is done,
it should be possible to use Helgrind/DRD on Qt without false errors.

That could be complex though, because it may involve describing the
synchronisation induced by thread-safe reference counting. That's
doable -- I marked up Firefox's XPCOM implementation so it can be
Helgrinded, so I know it is possible.

LMK if I can help more.

J

David Faure

2012-08-25 23:30:43 UTC

Permalink

Post by Julian Seward
Or maybe Qt really is racey

Bingo :)
I tried extracting a testcase of the problem, and it turns out that Qt5's
feature detection (which implementation to use) goes wrong, and a no-op code
path ends up being used for this particular operation.

I reported this to Thiago, I'll be back with a real problem next time, not a
no-op implementation :-)

Thanks for the input.

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5

David Faure

2012-08-26 09:06:36 UTC

Permalink

Post by David Faure

Post by Julian Seward
Or maybe Qt really is racey

Bingo :)

OK, maybe not. Turns out the code runs as intended by the Qt developers.

Marc Mutz said "
The standard says it's racy, but the implementation of
std::atomic::load(memory_order_acquire) won't look different. Simple reads
and writes on x86 are already sequentially consistent. Think MESI cache
coherency. Before a CPU writes to a memory location it needs to acquire
exclusive ownership (E) of the cache line, the well-known "hardware mutex" on
a cache line that produces False Sharing, too. This seems to hold for all
architectures, cf. threads re: "Brief example ..." at
http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/thread.html
"

I attached a pure C++ testcase of the issue. Compile it with
g++ testcase_atomic_ops_helgrind.cpp -o testcase_atomic_ops_helgrind -lpthread

Thiago expects that helgrind can't autodetect this case and that helgrind-
macros markup is needed in Qt, I'm fine with adding that if you guys agree --
after you show me how by modifying the attached example :-)

Thanks.

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5

Julian Seward

2012-08-26 09:28:06 UTC

Permalink

Post by David Faure
Thiago expects that helgrind can't autodetect this case and that helgrind-
macros markup is needed in Qt, I'm fine with adding that if you guys agree
-- after you show me how by modifying the attached example :-)

IIUC then, QBasicAtomicInt::{loadAcquire,storeRelease} just does normal
loads and stores of an int. Right?

What atomicity properties are you expecting onDemandNumber() to have?
Viz, how is onDemandNumber supposed to behave when you have multiple
threads doing it at the same time on the same location?

J

David Faure

2012-08-26 09:40:52 UTC

Permalink

Post by Julian Seward

IIUC then, QBasicAtomicInt::{loadAcquire,storeRelease} just does normal
loads and stores of an int. Right?

Yep (on x86). This whole API exists so that other architectures can get
another implementation.

Post by Julian Seward
What atomicity properties are you expecting onDemandNumber() to have?
Viz, how is onDemandNumber supposed to behave when you have multiple
threads doing it at the same time on the same location?

static int onDemandNumber() {
static QBasicAtomicInt sharedInt = { 0 };
if (!sharedInt.loadAcquire()) {
// some calculation goes here
sharedInt.storeRelease(42);
}
return sharedInt.loadAcquire();
}

I think the point is that the first call to loadAcquire() should either return
0 or 42, but never some intermediate value due to a write in progress.

Hmm, I see the point though. This *is* racy by design, since you don't know if
you'll get 0 or 42, if another thread is running at the same time. The only
thing is, we know it's fine to get either one, since we'll simply do the
calculation twice if two threads get 0 at the same time. Overall this is more
performant than using a mutex every single time this is called.

I'm not sure what can be done then, to avoid a helgrind warning.

I presume the only solution is to annotate onDemandNumber() itself, not
QBasicAtomicInt. I.e. annotate all uses of QBasicAtomicInt where the overall
logic makes it safe, in order to still be able to detect unsafe uses...
(like load, increment, store).

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5

Julian Seward

2012-08-26 10:49:27 UTC

Permalink

Post by David Faure
I'm not sure what can be done then, to avoid a helgrind warning.

If you can guarantee that "// some calculation goes here" touches only
thread-local state, then there is only a race on sharedInt itself. In
which case, use VALGRIND_HG_{DISABLE,ENABLE}_CHECKING to disable reporting
on the relevant specific instances of the sharedInt.

If "// some calculation goes here" touches shared state then you will
have to use VALGRIND_{DISABLE,ENABLE}_ERROR_REPORTING instead. This is
a heavyweight and undesirable solution, though.

----------

My understanding of this is that it is in violation of the C++11 memory
model, which says that "the implementation" may deliver stores from one
core to another in any order, in the absence of any C++11 mandated inter-
thread synchronisation.

You can argue that for x86 the hardware's TSO guarantees make this
harmless ...

Post by David Faure
Marc Mutz said "
The standard says it's racy, but the implementation of

... but AIUI, "the implementation" also includes the compiler, and I
believe it has been observed that GCC will indeed break your code in
unexpected ways, in some cases. In short you need to prove that not
only the hardware won't reorder stores between cores -- which for x86
happens to be true -- but also the compiler won't.

J

David Faure

2012-10-24 06:21:07 UTC

Permalink

Hi, I'm finally coming back to this issue.

Post by Julian Seward

Post by David Faure
I'm not sure what can be done then, to avoid a helgrind warning.

This seems to be what I need.
VALGRIND_HG_DISABLE_CHECKING(&_q_value, sizeof(_q_value));
in loadAcquire silences the warning.

Surprisingly, VALGRIND_HG_ENABLE_CHECKING doesn't appear to work, though.
All races are suppressed, even obvious races that warn if disable+enable was
never used before. Testcase attached, see the call to oops(). In my tests,
ENABLE_CHECKING basically behaves like DISABLE_CHECKING (for instance if you
simply put a ENABLE_CHECKING at the beginning of loadAcquire and nothing else,
then there's no warning at all anymore).

Post by Julian Seward
----------
My understanding of this is that it is in violation of the C++11 memory
model, which says that "the implementation" may deliver stores from one
core to another in any order, in the absence of any C++11 mandated inter-
thread synchronisation.
You can argue that for x86 the hardware's TSO guarantees make this
harmless ...

Post by David Faure
Marc Mutz said "
The standard says it's racy, but the implementation of

Yes but AFAIU that's what the "volatile" does -- prevent the compiler from
reordering.

However valgrind can't possibly find out that "volatile" was used, if all that
does is disable compiler optimizations, so I agree that this cannot all work
out of the box, valgrind-specific markup is definitely needed in the Qt atomics
class. Current version of my patch attached -- no re-ENABLE for now, since it
doesn't work anyway ;)

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

David Faure

2013-01-19 10:00:37 UTC

Permalink

For a different approach to this issue: if helgrind can't reliably detect
atomic operations, as discussed before, and if VALGRIND_HG_ENABLE_CHECKING is
broken (see my previous email in this thread), then a simple solution is to
add suppressions.

{
QBasicAtomicPointer_load
Helgrind:Race
fun:_ZNK19QBasicAtomicPointer*loadAcquireEv
}
has cut down the noise considerably in my tests.
Plus a few similar ones, like QBasicAtomicInt.

Historically we had Qt-related suppressions in a suppression file shipped with
KDE, but there are many more Qt users than the KDE developers. Would it be OK
to ship a qt5.supp file within valgrind, and load it unconditionnally, like
xfree-4.supp is currently handled?

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

David Faure

2012-08-27 11:44:03 UTC

Permalink

Post by David Faure

Post by Julian Seward

Post by David Faure
Thiago expects that helgrind can't autodetect this case and that
helgrind- macros markup is needed in Qt, I'm fine with adding that if
you guys agree -- after you show me how by modifying the attached
example :-)

IIUC then, QBasicAtomicInt::{loadAcquire,storeRelease} just does normal
loads and stores of an int. Right?

Yep (on x86). This whole API exists so that other architectures can get
another implementation.

static int onDemandNumber() {
static QBasicAtomicInt sharedInt = { 0 };
if (!sharedInt.loadAcquire()) {
// some calculation goes here
sharedInt.storeRelease(42);
}
return sharedInt.loadAcquire();
}

If sharedInt is a std::atomic<int>, then the above is not a C++11 data race,

No, it's a bare int, just like Qt5's QBasicAtomicInt does by default on x86.
You missed the initial email with the testcase, here it is
(testcase_atomic_ops_helgrind.cpp).

so Helgrind shouldn't warn.

Helgrind warns, hence my mail here.

If 'some calculation goes here' doesn't write
to memory global memory, you could even drop the memory ordering. Atomics
don't participate in data races, but they might also not establish the
happens-before that you need elsewhere.

This is obviously a reduced testcase ;)
The calculation, in the example I took, is to register the metatype, which is
global stuff, but which is somewhat safe if done twice (it will return the same
number). Inside QMetaType::registerNormalizedType there's a mutex. The whole
point, though, as I see it, is to avoid locking in the very common case where
the metatype has been registered already.

Post by David Faure
I think the point is that the first call to loadAcquire() should either
return 0 or 42, but never some intermediate value due to a write in
progress.

Correct. What's more: for any thread, any read of an atomic variable must
return a value previously read, or a value written later by another thread.
IOW: once a thread sees 42, it can't magically see 0 the next time.
According to Hans Boehm, this may happen on IA-64 because the architecture
allows to reorder reads from the same memory location.

Which is why the atomic stuff expands to different code on IA-64, so no problem.

If it writes to a shared memory location, the code contains a C++11 data
race, and you need more complex synchronisation

David Faure

2013-01-23 14:08:27 UTC

Permalink

If atomic loads and stores on x86 are implemented with a volatile cast,
then the compiler can't reorder stuff around them, either. Not more than
with a std::atomic, at least. QAtomic does that. For load-relaxed, Thiago
thinks that a normal read (non-volatile) is correct and I couldn't prove
him wrong.

I was talking to Julian about this again today, and he pointed me to this writeup:

http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming

We're looking at how to silence valgrind about Qt atomic ops, but before that,
it would actually be good to be sure that what Qt does it correct, on x86....

Where does the claim about "volatile cast means the compiler can't reorder stuff around them"
come from?

In the Qt source code I see qatomic_gcc.h (which is unused, unfortunately) calling
__sync_synchronize() in loadAcquire(). Shouldn't Qt's code call that, when compiled with gcc?

This does lead to different assembly, too, on x86.
So the claim that x86 doesn't need memory barriers seems wrong?

--- testcase_atomic_ops_helgrind.s.orig 2013-01-23 15:04:20.889417624 +0100
+++ testcase_atomic_ops_helgrind.s 2013-01-23 15:07:06.938422071 +0100
@@ -380,6 +380,7 @@ _ZN10QAtomicOpsIiE11loadAcquireERKi:
movq -24(%rbp), %rax
movl (%rax), %eax
movl %eax, -4(%rbp)
+ mfence
movl -4(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
@@ -403,6 +404,7 @@ _ZN10QAtomicOpsIiE12storeReleaseERii:
movq -8(%rbp), %rax
movl -12(%rbp), %edx
movl %edx, (%rax)
+ mfence
popq %rbp
.cfi_def_cfa 7, 8
ret

--
David Faure, ***@kde.org, http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5

Julian Seward

2013-01-23 14:57:35 UTC

Permalink

Post by David Faure
http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming
We're looking at how to silence valgrind about Qt atomic ops, but before that,
it would actually be good to be sure that what Qt does it correct, on x86....
Where does the claim about "volatile cast means the compiler can't reorder stuff around them"
come from?

I think that claim is incorrect. It may be that the compiler is not
allowed to reorder volatile stores, but for sure it can reorder volatile
stores w.r.t. non-volatile stores, which makes the volatile stores
completely useless from a multithreading perspective.

That writeup contains a nice example:

volatile int Ready;
int Message[100];

void foo( int i ) {
Message[i/10] = 42;
Ready = 1;
}

Imagine that 'message' is some chunk of info that we intend to give to
another thread, and 'Ready' is a flag that we intend to use to tell the
other thread when the message is ready to be read. gcc-4.7.2 -m32 -O2
produces this assembly (omitting the CFI junk):

foo:
movl 4(%esp), %ecx
movl $1717986919, %edx
movl $1, Ready <---------- Ready = 1;
movl %ecx, %eax
imull %edx
sarl $31, %ecx
sarl $2, %edx
subl %ecx, %edx
movl $42, Message(,%edx,4) <--------- Message[i / 10] = 42;
ret

so the volatile clearly didn't have the claimed effect.

Bottom line is, as I understand it, that C++11 allows implementations
(compiler + hardware) to assume code is race free, and optimise
accordingly. On x86, we have the added guarantee that the hardware
will deliver stores in order, but the compiler is still free to reorder
ordinary stores.

J

Dave Goodell

2013-01-23 15:22:49 UTC

Permalink

Post by David Faure

[…]

Post by David Faure
Where does the claim about "volatile cast means the compiler can't reorder stuff around them"
come from?

Julian already addressed this in a separate post, and I agree with his assessment re: access to volatile-qualified variables.

Post by David Faure
In the Qt source code I see qatomic_gcc.h (which is unused, unfortunately) calling
__sync_synchronize() in loadAcquire(). Shouldn't Qt's code call that, when compiled with gcc?

If you don't want to write inline assembly, this might be your best bet. But on TSO systems like x86, you only need a "compiler barrier". In x86 inline assembly syntax, this looks like:

__asm__ __volatile__ ( "" ::: "memory" )

This prevents GCC (and compilers that correctly support this syntax) from reordering accesses across this statement or making assumptions about the state of memory across this point.

The only case you need to worry about on x86 (assuming you are not fiddling with "special" memory) is that earlier stores could be reordered after subsequent loads. That should be the only time you need the real "mfence" instructions that you have below. "load-acquire" and "store-release" don't fall into that category.

Post by David Faure
This does lead to different assembly, too, on x86.
So the claim that x86 doesn't need memory barriers seems wrong?
--- testcase_atomic_ops_helgrind.s.orig 2013-01-23 15:04:20.889417624 +0100
+++ testcase_atomic_ops_helgrind.s 2013-01-23 15:07:06.938422071 +0100
movq -24(%rbp), %rax
movl (%rax), %eax
movl %eax, -4(%rbp)
+ mfence
movl -4(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
movq -8(%rbp), %rax
movl -12(%rbp), %edx
movl %edx, (%rax)
+ mfence
popq %rbp
.cfi_def_cfa 7, 8
ret

-Dave

Julian Seward

2013-01-23 15:47:32 UTC

Permalink

Post by Dave Goodell
If you don't want to write inline assembly, this might be your best bet.
But on TSO systems like x86, you only need a "compiler barrier".
__asm__ __volatile__ ( "" ::: "memory" )
This prevents GCC (and compilers that correctly support this syntax)
from reordering accesses across this statement or making assumptions
about the state of memory across this point.

Yeah, you're right. Thanks for bringing clarity to my (concurrently-)
fried brain.

To make a stab at an executive summary, then:

* in intentionally racey code, we need to worry about both hardware and
compiler reordering stores. That requires, at a minimum, inserting
suitable load and store fences.

* on x86/x86_64, we only need to worry about the compiler reordering
stores, since the hardware promises not to. That means all we need
is a compiler barrier, as you show -- probably on both sides of
both the load and store.

* regardless of the above 2 points, since C++11 allows compilers to
assume code is race free, we can't rule out the possibility that the
compiler does some other legal transformation which makes the code
not work as the authors intended.

Because of this last point, overall I'm still with Pat LoPresti, who
summarised the situation admirably thus: "There is no such thing as a
benign data race. Ever." (v-users, 4 Dec 2012)

IMO, the idea that some races are harmless is a dangerous myth that
we should try to stamp out.

J

Dave Goodell

2013-01-23 16:33:08 UTC

Permalink

Post by Dave Goodell
If you don't want to write inline assembly, this might be your best bet.
But on TSO systems like x86, you only need a "compiler barrier". In x86
__asm__ __volatile__ ( "" ::: "memory" )
This prevents GCC (and compilers that correctly support this syntax) from
reordering accesses across this statement or making assumptions about the
state of memory across this point.

Indeed, but it also prevents proper merging of stores and loads. For example,
if x is an atomic variable, the compiler can and should merge these two
x = 1;
x = 2;
It's a trade-off until we have the proper implementation with std::atomic.

Agreed. My comments refer to the current common state of the world, not a C11/C++11 world. That's all too new for me to be able to rely on its availability yet. I am eagerly awaiting the day that inline assembly and atomic libraries can be ditched altogether.

Post by Dave Goodell
The only case you need to worry about on x86 (assuming you are not fiddling
with "special" memory) is that earlier stores could be reordered after
subsequent loads. That should be the only time you need the real "mfence"
instructions that you have below. "load-acquire" and "store-release" don't
fall into that category.

Which is why the qatomic_gcc.h implementation is a last resort for Qt. Since
it does a full memory barrier including mfence, it's way too expensive.
Note that x86 does have sfence and lfence instructions too. I should go ask
some colleagues at Intel about when they should be used, because I haven't yet
found a case where they are needed.

It does, but AIUI they are only needed for write-combining (WC) memory regions, such as memory mapped video card frame buffers.

Here's the listing of the current implementations of atomics in Qt and their
0) MSVC: applies only to MSVC, using intrinsics
1) Arch-specific implementation: uses assembly for the fetchAndAdd, testAndSet
and fetchAndStore operations, but uses direct volatile loads/stores for
loadAcquire and storeRelease, and non-volatile loads/stores for load/store,
which are subject to reordering by the compiler. This is the default on x86
with GCC, Clang and ICC.

The above will probably only work on Itanium machines where compilers seem to emit acquire/release suffixes on volatile load/store operations. You should probably update this implementation to include compiler barriers.

2) std::atomic implementation, if std::atomic is supported. This appeared on
GCC in 4.6, but was implemented using the old intrinsics with full barrier. In
GCC 4.7, it uses the new intrinsics that support more relaxed barriers. This
works fine for x86, but the implementation is incredibly sub-optimal on other
architectures (uses locks on ARM).
3) GCC old intrinsics implementation, the one with full barriers.
From GCC's plans, the implementation in 4.8 will solve the implementation
issues on other architectures, so we may start using that. That means Qt 5.2.

Makes sense.

In any of the implementations, all the code is inlined, so there's nothing for
valgrind to react to. The best we could do is insert some of valgrind's no-op
hints, on debug or special builds. Can you recommend what hints to add?
May I also suggest an out-of-line marker strategy, similar to what the Linux
kernel does for catching processor faults, and the ABI does for exceptions? If
we had this, we'd leave the markers enabled even in release builds.

I'm not sure I'm familiar with this technique. Do you have a link to any reading (even code is fine) that I could do on the subject?

-Dave

Dave Goodell

2013-01-23 16:35:02 UTC

Permalink

Sorry for the extra email. I accidentally whacked the "send email" key combination instead of the "paste" key combination. The link that I intended to send was:

http://thread.gmane.org/gmane.comp.debugging.valgrind.devel/11471/focus=11480

-Dave

Patrick J. LoPresti

2012-08-27 22:03:55 UTC

Permalink

Post by David Faure
Marc Mutz said "
The standard says it's racy, but the implementation of
std::atomic::load(memory_order_acquire) won't look different. Simple reads
and writes on x86 are already sequentially consistent.

I do not think this is true. Loads on x86 do have acquire semantics,
and stores do have release semantics, but that is not the same as
sequential consistency. If loads and stores on x86 were "already
sequentially consistent", there would be no need for the mfence
instruction...

And as Julian points out, the compiler has plenty of license to
reorder things even if the CPU does not. Dealing with all of this is
why C++11 has a memory model.

Out of curiosity, is Helgrind expected to generate errors for
correctly-implemented lock-free algorithms? (That is, algorithms with
correct memory barriers and no mutexes?)

- Pat

Julian Seward

2012-08-28 08:28:32 UTC

Permalink

Post by Patrick J. LoPresti

Post by David Faure
Marc Mutz said "
The standard says it's racy, but the implementation of
std::atomic::load(memory_order_acquire) won't look different. Simple
reads and writes on x86 are already sequentially consistent.

On re-reading this, I am wondering now if Marc meant "x86 TSO (total
store order)" when he said "sequential consistency". IIRC from Hennessy
and Patterson, no real processor provides sequential consistency because
to do so precludes the use of per-processor store queues, and without
those a high performance memory system is more or less impossible.
x86 does provide TSO, a guarantee that stores by one processor appear
to other processors in the same order. But that's a weaker guarantee
than sequential consistency.

Post by Patrick J. LoPresti
Out of curiosity, is Helgrind expected to generate errors for
correctly-implemented lock-free algorithms? (That is, algorithms with
correct memory barriers and no mutexes?)

It will generate false errors in cases where it can't "see" all of the
happens-before relationships between threads. By default it ships with
intercepts that correctly detect those relationships for POSIX pthreads
(by intercepting pthread_mutex_lock, unlock, etc). However, if you
create your own locking primitives or do synchronisation some other way,
there's a flexible set of client requests for marking up code, so it
can indeed be checked. I did this just the other day for home-rolled
spinlocks.

J

David Faure

2012-08-28 16:39:14 UTC

Permalink

---------- Forwarded Message ----------

Subject: Re: [Valgrind-users] helgrind and atomic operations
Date: Monday 27 August 2012, 15:25:14
From: Marc Mutz <***@kdab.com>
To: David Faure <***@kde.org>
CC: valgrind-***@lists.sourceforge.net

[I can't post to valgrind-users, please fwd]
[...]

Post by David Faure

Post by David Faure
static int onDemandNumber() {
static QBasicAtomicInt sharedInt = { 0 };
if (!sharedInt.loadAcquire()) {
// some calculation goes here
sharedInt.storeRelease(42);
}
return sharedInt.loadAcquire();
}

If sharedInt is a std::atomic<int>, then the above is not a C++11 data
race,

No, it's a bare int, just like Qt5's QBasicAtomicInt does by default on
x86. You missed the initial email with the testcase, here it is
(testcase_atomic_ops_helgrind.cpp).

The point I was trying to make was that unless Q*Atomic has the same semantics
as std::atomic, we don't need to continue talk about this, so let's assume
that it should have, and check what's missing, then. And, according to the
threads here:
http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/
the implementation of std::atomic would (except for IA-64 load) look exactly
the same as QAtomic, save that the compiler must magically know it's dealing
with an atomic operation, and not reorder them. AFAIU, this is ensured by the
use of asm() in the QAtomic code, or alternatively with a volatile cast.

I'm still not sure that Helgrind will be able to detect atomic operations from
assembler instructions alone, because at least on x86, all loads and stores
are already atomic (because of E in MESI), unless they're misaligned and/or
cross cache-line boundaries.

What's worth, memory barriers on x86 are no-ops, unless you go for
std::memory_order_seq_cst, which QAtomics don't implement.

So, IMO, QAtomic needs to mark up the code such that valgrind can see that
*this* MOV comes from an atomic, and thus never races against another
*atomic* MOV on the same memory location, but races against a MOV that's not
emitted from std::atomic or QAtomic: the identical assembler code could mean
a C++11 data race or not; the semantics have been lost at the compilation
stage.

Post by David Faure

so Helgrind shouldn't warn.

Helgrind warns, hence my mail here.

This is obviously a reduced testcase ;)
The calculation, in the example I took, is to register the metatype, which
is global stuff, but which is somewhat safe if done twice (it will return
the same number). Inside QMetaType::registerNormalizedType there's a mutex.
The whole point, though, as I see it, is to avoid locking in the very
common case where the metatype has been registered already.

It's the job of QMetaType to ensure visibility. All data that could be thought
of being associated with the ID is private to QMetaType and both in the
current as well as the new one proposed in
https://codereview.qt-project.org/#change,30559 contain additional fences to
order writes to and reads from the custom metatype storage. So the memory in
qt_metatype_id() are indeed unnecessary.

[...]

Post by David Faure

If it writes to a shared memory location, the code contains a C++11 data
race, and you need more complex synchronisation

Yes, which is there, but that's not the point.