Thread Links | Date Links | ||||
---|---|---|---|---|---|
Thread Prev | Thread Next | Thread Index | Date Prev | Date Next | Date Index |
Arnold Neumaier <Arnold.Neumaier@xxxxxxxxxxxx> wrote on 08/05/2011 07:35:51 AM:
> [image removed]
>
> Re: fwd from Jim Demmel: More on repeatability
>
> Arnold Neumaier
>
> to:
>
> Ian McIntosh
>
> 08/05/2011 07:38 AM
>
> Please respond to Arnold Neumaier
>
> On 08/04/2011 11:51 PM, Dan Zuras Intervals wrote:
> >> From: James Demmel<demmel@xxxxxxxxxxxxxxxxx>
> >>
> >> Just to supply a little more background on the need for being able
> >> to get the same answer when you run your program more than once:
> >>
> >> Many of you may recall a post to NA-Digest a couple of years ago,
> >> in which a commercial FEM software developer was asking whether
> >> anyone knew of a "repeatable" parallel sparse linear system solver;
> >> here "repeatable" means getting the same answer when you type
> >> a.out twice on the same machine, not the harder problem of the
> >> same answer on different machines.
>
> It makes sense to require repeatability on the same machine. I expect
> thst, too, for debugging purposes.
>
> But the prior discussion was about repeatability across platforms, which
> can hardly be guaranteed without specifying exactly the requires result
> of each operation (including rounding errors), and hence is an
> unreasonable requirement.
>
> >
> >> The point of all this is to say that repeatability on the same machine
> >> (let alone on different machines) is both widely expected and desired,
> >> hard to attain, and likely to be an unpleasant surprise to many users
> >> if and when they realize this. This is true with or without intervals.
> >> Of course interval bounds that are reliably narrow, if not repeatable,
> >> will mitigate the problem.
Repeatability on the same machine is both widely expected and sometimes unachieved. Several things can interfere with it:
- Uninitialized variables.
- Race conditions, where it's partly a matter of luck whether a load in one thread is before or after a store in another. Race conditions are usually bugs but sometimes deliberately allowed and considered a valid part of the algorithm.
- Less obviously, you can get race conditions without multithreading if the operating system and hardware let you access the same physical memory via different virtual addresses; eg, lets you use mmap to map the same file to two different blocks of memory.
- Adaptive multithreading, where the number of threads used depends on how many are available, which may change due to workload, upgrades, scheduled maintenance, or partial system failures. Changing the number of threads may affect how the work is divided which can affect things like parallelized reductions.
- Running the program in a debugger versus directly. It's not obvious, but some instructions in some architectures (including some used for atomic operations, thread synchronization or transactional memory) cannot function exactly as normal when a debugger does a context switch to stop at breakpoints, to step through instructions one at a time, or to trace instructions.
- Dependencies on external data or events, like data files, networks, using the time to set a random seed or other data.
- A hardware design bug (most CPUs ever built have had at least one).
- A hardware malfunction.
Some of those are rare, but I've experienced every one of them.
Some of them would manifest themselves as slightly wider intervals. Some would just produce wrong answers violating containment. Generally they should not affect our standard, except that we need to be careful of wording. Requiring "always reproducible results" is impossible, even running the same executable program on the same system. Requiring "reproducible results under identical conditions" is practical, whether it is part of the standard or not.
- Ian McIntosh IBM Canada Lab Compiler Back End Support and Development