m = 0.5 * (l + u)
will overflow when the sum l + u
does, eg, when both l
and u
are in the most positive half of the positive representable range, or both
in the most negative half of the negative representable range.
m = l + 0.5*(u - l)
will overflow when the difference u - l
does, eg, when l
is negative and u
positive and each is in the most nonzero half of the representable range
for its sign.
Changing that to round towards zero
avoids the overflow.
m = 0.5*l + 0.5*u
will only overflow if both l
and u
have the same sign and both are equal to the maximum magnitude representable
value of that sign, and the rounding method used for the calculation rounds
it away from zero. It can also underflow, so would unnecessarily
turn 0.5*SUBNORMAL_MIN + 0.5*SUBNORMAL_MIN
into zero when the other ways would give SUBNORMAL_MIN.
(Note SUBNORMAL_MIN
is not a standard name.)
It may be possible to avoid problems
by using m = 0.5*l + 0.5*u
and rounding the multiplies in opposite directions (somebody please
check that). Since changing rounding modes is expensive on some architectures,
a faster alternative on them is m = 0.5*l
- (-0.5*u) rounding both
multiplies in the same direction, if you can prevent your compiler from
optimizing it into m = 0.5*l + 0.5*u.
These can all give reasonable but slightly
differently rounded answers. Providing portability and repreducability
means all implementations must use the same way.
Costs are:
m = 0.5 * (l +
u):
1 constant load + 1 add/subtract
+ 1 multiply
m = l + 0.5*(u
- l):
1
constant load + 2 adds/subtracts + 1 multiply
m = l + 0.5*(u
- l) rounded down:
1 constant load + 2 adds/subtracts + 1 multiply +
on some architectures cost of saving, setting and restoring the rounding
mode
m
= 0.5*l + 0.5*u:
1 constant load
+ 1 add/subtract + 2 multiplies
m
= 0.5*l - (-0.5*u):
either 2 constant loads + 1 add/subtract + 2 multiplies
or
1 constant load + 2 adds/subtracts/negates + 2 multiplies
If the cost doesn't matter, avoiding
overflow and underflow error in all cases is easy - just check the signs
and values first, then choose a formula safe for those.
Integer index calculations have fewer
problems because in C an index is always nonnegative, so l
and u
must have the same sign, and integer arithmetic rounding is always towards
zero, so INT_MAX/2 + INT_MAX/2 can't oveflow.
- Ian Toronto
IBM Lab 8200 Warden D2-445 905-413-3411
----- Forwarded by Ian
McIntosh/Toronto/IBM on 23/02/2009 02:18 PM -----
Please respond to
Guillaume Melquiond <guillaume.melquiond@xxxxxxxx>
To
Ian McIntosh/Toronto/IBM@IBMCA
cc
Subject
Re: small correction to the Vienna Proposal
Le lundi 23 février 2009 à 11:14 -0700, Nelson H.
F. Beebe a écrit :
> Chenyi Hu writes today:
>
> >> I'd prefer m = 0.5 * (l + u) rather than m = l + 0.5*(u -
l) because of
> >>
> >> 1. Avoiding the loss of significance in performing u-l for
narrow intervals
> >> 2. One less arithmetic operation.
>
> Chenyi's alternative suffers from premature overflow when l and u
are
> greater than half the maximum representable number. The proposed
form
> l + 0.5*(u - l) is safe.
It depends what you mean by "safe". Getting a midpoint that is
not a
finite number (l big negative, u big positive) could break a lot of
things. So it would be better to ensure that the midpoint is always
finite:
set round down
m = l + 0.5 * (u - l)
r = -(m - u)
(Disclaimer, I have no idea if the formula above is always correct.)