Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Another relevant consideration Re: Motion P1788/M0009.01_ExactDotProduct



Determinism is indeed challenging, since floating point is not
associative, and a lot of codes will not necessarily give the same
answer twice.

An interesting example arose when we were trying to debug the
ScaLAPACK nonsymmetric eigensolver, that computes the Schur
form A = U*T*U^T. It has two modes, depending on whether one want
the Schur vectors U or not. When U is requested, one checks
correctness by making sure the residuals norm(A - U*T*U^T)
and norm(U*U^T - I) are small enough, and T is (quasi)triangular.
But when U is not requested, all one can do is compare the T
computed without U to the T previously computed with U.
But because of nonassociative floating point, the eigenvalues
converged in different orders, making the two Ts very different,
even during the same run, and so impossible to compare,
(the off-diagonals of T are quite
different as well, not just the order of eigenvalues on the diagonal).
So the only way to pass the tests was to insist that the
PBLAS (parallel BLAS) run in deterministic mode, where they
always use the same reduction tree, and so sacrifice a bit of
performance, but fortunately rather little, at least for the
problem sizes we tested.

We have had feedback from some users (eg Mathworks) that
determinism is critical to them.

If a standard specifies a unique answer (like "round to nearest even")
then you can count on determinism. If you only specify "faithful rounding",
say, then I see no guarantee. And it can easily be lost by
higher level routines.not specified by the standard, like (P)BLAS or
equation solvers or whatever any user may write.

So the question is whether you want the standard to enable a concerned
user to guarantee determinism of higher-level codes, if they want to.
This seems like a worthy goal, given that you are targeting a user
community who is likely to be the most sensitive to these issues.
On the other hand, if you do this by insisting on a much more expensive
implementation than otherwise necessary, you may discourage other users.

I'm not sure how to quantify this tradeoff, or whether it makes sense
to address the needs of both kinds of users (at the likely cost of a more
complicated standard).

In the case of summations or dot products, insisting on correct rounding,
or narrowest interval enclosures, obviously guarantees determinism. I'm
not sure how much faster a looser requirement of just , say, a few ulps
would be,
with or without specifying "determinism" in the standard.

Jim



Ralph Baker Kearfott wrote:
> Jim,
>
> Does the fact that virtually all machines are parallel now
> imply that it is impossible for computations to be totally reproducible
> anyway, or does it imply that reproducibility is still
> achievable if we have sufficiently strict arithmetic
> standards and they are implemented?
>
> Baker
>
> James Demmel wrote:
>   
>> Independent (mostly) of the 4 levels of accuracy that Siegfried points
>> out below,
>> there is the question of whether the standard should require the
>> operation to be
>> deterministic, i.e. get the same answer whether it is run on different
>> machines,
>> or even twice on the same machine. Unless the answer is uniquely defined
>> ("round
>> to nearest even", "round down" etc.) there is no reason an implementation
>> on a platform with different numbers of processors available at
>> different times,
>> even during a single run, would always use the same reduction tree to
>> compute
>> a sum or analogous operation, and so get the same answer.
>>
>> We have both kinds of users of our linear algebra software, those for
>> whom speed
>> is more important than determinism (and so using the fastest reduction
>> tree is more
>> important than the same reduction tree), and those who prioritize
>> determinism
>> (to ease debugging, not confuse users, and sometimes because it is
>> required).
>> Since it is rapidly becoming impossible to buy a non-parallel machine
>> (because of
>> multicore, etc.), I wonder what people think the standard should say
>> about determinism?
>>
>> Jim
>>
>> Siegfried M. Rump wrote:
>>     
>>> On Sat, 31 Oct 2009 23:56:04 +0100, Ralph Baker Kearfott
>>> <rbk@xxxxxxxxxxxx> wrote:
>>>
>>>       
>>>> P-1788,
>>>>
>>>> I've been trying to decide how to vote on this issue, and
>>>> am wrestling with the following consideration: If we demand
>>>> the exact dot product as stated in the motion, we will obtain
>>>> reproducibility. However, this will be at the expense of having
>>>> an efficient software implementation. We have efficient software
>>>> implementations (e.g. Rump et al) of faithfully rounded dot products,
>>>> which
>>>> would be standard-conforming, and would to my knowledge satisfy
>>>> the needs of applications requiring an accurate dot product, if we
>>>> loosened the requirement, but I don't know if we can word the
>>>> standard to both allow these and for the results to be bit-wise
>>>> reproducible.
>>>>
>>>> This is just some food for thought, or for refutation.
>>>>
>>>> Baker
>>>>
>>>>         
>>> P-1788,
>>>
>>> Arnold asked me to comment on the issue, so here are some thoughts.
>>>
>>> Let a floating-point arithmetic with relative rounding error unit eps
>>> be given, e.g. 2^-53 in double precision. For the result of a sum or
>>> dot product there are basically 4 different "accuracy levels":
>>>
>>> 1 ordinary (recursive) evaluation, accuracy ~ eps*cond
>>> 2 eveluation in precision eps^2, accuracy ~ eps^2*cond
>>> 3 faithfully rounded result, accuracy ~ eps
>>> 4 rounded to nearest result, accuracy ~ 0.5*eps
>>>
>>> It was asked, what do we need, and how much does it cost, very
>>> reasonable questions.
>>>
>>> First, what do we need. IMHO and for my scope of applications:
>>>
>>> frequently level 1 is not enough.
>>> In many cases level 2 is needed and it is sufficient.
>>> Level 3 would be nice to have because it closes the gap between
>>> precision and accuracy.
>>> In special cases level 4 may be nice.
>>>
>>> The difference between 3 and 4 reminds me on faithful and to nearest
>>> conversion. For example,
>>>
>>> f =
>>> .93132257461547861902257656912845935892608650874535669572651386260986328126e-9
>>>
>>>
>>> would likely yield the result 2^(-30), and scanning all except the last
>>> mantissa digit "6" this is correctly rounded to nearest. But with the
>>> last
>>> digit it is not correct. In both cases, 2^(-30) would be a correct
>>> faithful
>>> rounding.
>>>
>>> A big advantage of rounding to nearest is that the result is uniquely
>>> defined, whereas for faithful rounding, in general, two results meet
>>> the specification.
>>>
>>> From a numerical point of view, the computing time of an algorithm
>>> producing a faithfully rounded result is proportional to the logarithm
>>> of the condition number, in my opinion a nice property.
>>>
>>> For rounding to nearest the computing time also depends on the nearness
>>> of the result to the midpoint of adjacent floating-point numbers (call
>>> that switching point), strange from a numerical point of view.
>>>
>>> Second, how much does it cost.
>>> I designed my algorithms to be fast in a high level language like C or
>>> Fortran, and therefore branches or access to mantissa or exponents are
>>> avoided. For a hardware implementation this changes completely.
>>>
>>> For summation fast algorithms are available, and in general there is not
>>> too much difference in computing time for levels 2 or 3 or 4. Of course,
>>> if the result is very near a switching point, then time for level 4
>>> increases;
>>> but those cases may be rare.
>>>
>>> For dot product the picture changes. The new difficulties come from
>>> more or less constructed examples, namely intermediate under- or
>>> overflow.
>>> Here fast algorithms for level 2 are available, and for many applications
>>> sufficient.
>>> For levels 3 or 4 care is necessary for intermediate under/overflow.
>>> One possibility is scaling. Since I expect more or less constructed
>>> examples with intermediate under/overflow, the time penalty may in
>>> general be small.
>>>
>>> With some care, the algorithms already discussed can be adaptated from
>>> faithful rounding to rounding to nearest, so that basically a time
>>> penalty applies only if exact the result is near a switching point.
>>>
>>> If one does not like to guess how often intermediate under/overflow or
>>> nearness
>>> to a switching point occurs, then Malcolm's algorithm or a long
>>> accumulator
>>> advocated by Kulisch may be used.
>>>
>>> Although few people may find it necessary to have _always_ a rounded to
>>> nearest result, a clear advantage is that the result is unique. From a
>>> numerical point of view, even level 2 would be a big step and often
>>> sufficient.
>>>
>>> Best wishes
>>>
>>> Siegfried M. Rump
>>>
>>> --=====================================================
>>> Prof. Dr. Siegfried M. Rump
>>> Institute for Reliable Computing
>>> Hamburg University of Technology
>>> Schwarzenbergstr. 95
>>> 21071 Hamburg
>>> Germany
>>> phone +49 40 42878 3027
>>> fax +49 40 42878 2489
>>> http://www.ti3.tu-harburg.de
>>>
>>> and
>>>
>>> Visiting Professor at Waseda University
>>> Faculty of Science and Engineering
>>> Shinjuku Lambdax Bldg. 902
>>> 2-4-12 Okubo, Shinjuku-ku
>>> Tokyo 169-0072
>>> Japan
>>> phone/fax in Japan +81 3 5286 3414
>>>       
>
>
>