[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
tininess detection emulation -- before/after rounding
- To: stds-754 <stds-754@xxxxxxxxxxxxxxxxx>
- Subject: tininess detection emulation -- before/after rounding
- From: Michel Hack (1-914-784-7648) <hack@xxxxxxxxxxxxxx>
- Date: Monday 15 Oct 2007 at 6:32 p.m. EDT (2007-10-15 22:32 GMT)
It turns out that my misunderstanding of after-rounding tininess detection
does NOT affect the analysis of one method over the other. It remains the
case that the two methods differ only (but not always) when the returned
result magnitude is Nmin.
(1) Emulating "after rounding" on a "before rounding" machine.
The hardware may generate some exceptions that must be covered up.
On System z (IBM mainframes), run with IEEE Underflow traps
enabled. On this architecture, the Underflow exception flag
is not raised when the exception is trapped, and in-line restart
of the failing instruction is possible. So, if the result value
is Nmin (in magnitude), a total coverup is possible. If not, and
the trap was desired by the application, it can be handled normally.
If the application wanted default exception handling, and the result
is not Nmin, the instruction has to be be re-executed with the same
arguments in order to obtain the denormalised result. This may be
a problem, because one operand may have been overwritten. So we
have to do the denormalisation explicitly from the wrapped-exponent
result. This can be done with a single multiplication by the scaling
factor, executed with traps disabled. The correct summary flags will
be set as a side effect. So it's doable, but there is a performance
impact if underflow is frequent, and the application thought this was
not an issue because it thought traps were disabled (default handling
is in effect).
On System p (PowerPC), we have two problems: (a) restartable traps
are expensive, and (b) the trap raises the underflow flag. It is
therefore necessary to capture the flag before every susceptible
operation -- at which point depending on trap handling to save the
day becomes irrelevant. We might as well just use inline macros
or subroutines (compiler-generated for maximum benefit) for mult,
div and casts. Perhaps the compiler can track whether underflow
has already been signalled and transfer control to a second set
of instructions generated more efficiently -- but in the general
case the Underflow flag must be checked before the operation, and
if clear, the operation has to be followed by an Nmin check or a
second Underflow check (whichever is more efficient) to see whether
the Underflow flag needs to be reset.
So it is painful.
On x86, the issue is different. If operands are guaranteed to
remain in the standard format range (including subnormal precision
when applicable), after-rounding tininess detection already works
as desired. I'm told this was a very big IF however, requiring
storing and reloading operands after or before every operation,
because the bounded-precision setting does not in itself lead to
denormalisation. Somebody else would have to fill in the details,
but I suspect that there is a performance penalty somewhere.
If the compiler offers no assistance, but the programmer wants to
achieve the desired reproducible behaviour in a portable manner
(for default exception handling, otherwise portability really
becomes questionable), I *think* the primitives are there to carry
out the trapless mechanism I described above. It requires something
before every susceptible operation as well as after.
(2) Emulating "before rounding" on an "after rounding" machine.
The hardware fails to generate some exceptions that are required.
All of these cases are distinguished by the fact that the result
magnitude is Nmin. Nothing needs to be undone, so there is no
need to check anything before a susceptible operation, but the
result of susceptible operations must be checked for one or two
explicit values -- i.e. the exceptional branches will almost never
be taken. The issue is therefore code space, not performance (at
least not directly; code space does impact performance indirectly,
which is why I have mentioned outlining techniques).
Earlier I showed a portable way to achieve this, and the only
compiler requirements are decent optimisation capabilities to
detect and remove dead code, especially in when static modes
are in effect.
There is a tradeoff between asking the compiler to enforce the
reproducibility and doing it "by hand", as the programmer may
know that underflow is not an issue in large sections of code.
Perhaps there should be control over this capability.
An x86 specialist (paging Terje Mathisen) might be able to show
how well this could be done on that platform.
Michel.
Sent: 2007-10-15 23:33:55 UTC