[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tininess detection emulation -- before/after rounding



It turns out that my misunderstanding of after-rounding tininess detection
does NOT affect the analysis of one method over the other.  It remains the
case that the two methods differ only (but not always) when the returned
result magnitude is Nmin.

(1) Emulating "after rounding" on a "before rounding" machine.

    The hardware may generate some exceptions that must be covered up.

    On System z (IBM mainframes), run with IEEE Underflow traps
    enabled.  On this architecture, the Underflow exception flag
    is not raised when the exception is trapped, and in-line restart
    of the failing instruction is possible.  So, if the result value
    is Nmin (in magnitude), a total coverup is possible.  If not, and
    the trap was desired by the application, it can be handled normally.

    If the application wanted default exception handling, and the result
    is not Nmin, the instruction has to be be re-executed with the same
    arguments in order to obtain the denormalised result.  This may be
    a problem, because one operand may have been overwritten.  So we
    have to do the denormalisation explicitly from the wrapped-exponent
    result.  This can be done with a single multiplication by the scaling
    factor, executed with traps disabled.  The correct summary flags will
    be set as a side effect.  So it's doable, but there is a performance
    impact if underflow is frequent, and the application thought this was
    not an issue because it thought traps were disabled (default handling
    is in effect).

    On System p (PowerPC), we have two problems:  (a) restartable traps
    are expensive, and (b) the trap raises the underflow flag.  It is
    therefore necessary to capture the flag before every susceptible
    operation -- at which point depending on trap handling to save the
    day becomes irrelevant.  We might as well just use inline macros
    or subroutines (compiler-generated for maximum benefit) for mult,
    div and casts.  Perhaps the compiler can track whether underflow
    has already been signalled and transfer control to a second set
    of instructions generated more efficiently -- but in the general
    case the Underflow flag must be checked before the operation, and
    if clear, the operation has to be followed by an Nmin check or a
    second Underflow check (whichever is more efficient) to see whether
    the Underflow flag needs to be reset.

    So it is painful.

    On x86, the issue is different.  If operands are guaranteed to
    remain in the standard format range (including subnormal precision
    when applicable), after-rounding tininess detection already works
    as desired.  I'm told this was a very big IF however, requiring
    storing and reloading operands after or before every operation,
    because the bounded-precision setting does not in itself lead to
    denormalisation.  Somebody else would have to fill in the details,
    but I suspect that there is a performance penalty somewhere.

    If the compiler offers no assistance, but the programmer wants to
    achieve the desired reproducible behaviour in a portable manner
    (for default exception handling, otherwise portability really
    becomes questionable), I *think* the primitives are there to carry
    out the trapless mechanism I described above.  It requires something
    before every susceptible operation as well as after.

(2) Emulating "before rounding" on an "after rounding" machine.

    The hardware fails to generate some exceptions that are required.

    All of these cases are distinguished by the fact that the result
    magnitude is Nmin.  Nothing needs to be undone, so there is no
    need to check anything before a susceptible operation, but the
    result of susceptible operations must be checked for one or two
    explicit values -- i.e. the exceptional branches will almost never
    be taken.  The issue is therefore code space, not performance (at
    least not directly; code space does impact performance indirectly,
    which is why I have mentioned outlining techniques).

    Earlier I showed a portable way to achieve this, and the only
    compiler requirements are decent optimisation capabilities to
    detect and remove dead code, especially in when static modes
    are in effect.

    There is a tradeoff between asking the compiler to enforce the
    reproducibility and doing it "by hand", as the programmer may
    know that underflow is not an issue in large sections of code.
    Perhaps there should be control over this capability.

    An x86 specialist (paging Terje Mathisen) might be able to show
    how well this could be done on that platform.


Michel.
Sent: 2007-10-15 23:33:55 UTC

754 | revision | FAQ | references | list archive