(long) sNaNs not what they could be...
For anyone not interested in this topic, it
will be a long diatribe on the inadequacies
of NaNs as a diagonostic tool in 754.
I apologise for posting it in this forum.
Feel free to ignore & delete.
For the rest (if any) ...
> Date: 15 Oct 2010 21:35:26 +0100
> From: "N.M. Maclaren" <nmm1@xxxxxxxxx>
> To: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
> Cc: Lee Winter <lee.j.i.winter@xxxxxxxxx>, stds-1788@xxxxxxxxxxxxxxxxx
> Subject: Re: Fw: Useless sNaNs... or useful?
>
> On Oct 15 2010, Dan Zuras Intervals wrote:
> >> >>
> >> >> One important use is to initialize all floating point variables to
> >> >> signaling NaNs. If they are inadvertently not properly initialized
> >
> >=09What you & Ian probably don't know is that
> >=09I, along with Prof Kahan & Bob, advocated
> >=09for consistent behaviors for signalling NaNs
> >=09that would permit exactly this application.
>
> Good. That would restore functionality that some compilers had
> in 1970, though signalling NaNs is only one of many ways to achieve
> the objective. It does, however, have the advantage that it has no
> performance impact on code without such errors.
>
> >=09As it happened there were sound technical
> >=09reasons for features that made it infeasible.
> >=09All of which were thought to be more important
> >=09than this application. All of which I can
> >=09describe for you in detail but hesitate to do
> >=09so in this forum.
>
> I am unaware of any, though I am aware of a great
> many UNSOUND technical reasons :-(
>
> Please could you describe the sound technical reasons?
> If you are reluctant to do so in a semi-public forum,
> Email will do.
>
>
> regards,
> Nick Maclaren.
I will outline it.
But even that outline will take some time to
explain.
What we were after was a 'touch it & die NaN'.
Something which even dereferencing would cause
an invalid trap. Presumably to a debugger or
to signal the use of an uninitialized variable.
The method would be to fill memory with this
fatal signalling NaN so that any read access
would explode the mine. If your first use of
memory was to write to it, you were safe. But
if you read from it you would die the death of
the uninitialized memory signalling NaN.
Sounds simple enough.
Why was that hard to do?
Well, on most systems the load instruction is
not typed. It has a width but not a type.
Thus, if I am reading, say, a 32-bit quantity
out of memory I generally don't know whether
it is an integer, 4 characters, a single
precision floating-point number, or part of
a larger structure (either a larger floating-
point number or some non-floating-point
structure).
So, die on load was not really feasible.
No matter, it would be sufficient to die on
first floating-point touch.
Which would be fine if all floating-point
touches went through the floating-point ALU.
Alas, on many systems (Intel included) there
are 3 floating-point operations that do not.
They are copy, negate, & absolute value.
They can avoid the ALU because they are just
bit copies with a possible modification of
the sign bit. And they are singled out in
the standard for this reason.
Of these the most important is copy. It is
used on assignment.
Or not. You see, modern optimizers are such
that copies are generally eliminated except
in rare cases. So, even if we were to 'arm'
copies to trigger invalid we can't count on
them actually being there.
It is a bit more complex, but much the same
can be said for negate. Unary negates are
mostly eliminated by manipulation of prior
or subsequent add-like operations (changing
add to subtract or one kind of FMA to
another).
Absolute value is generally safe but also
not often used.
So these operations provide a hole through
which an uninitialized value can slip
unnoticed.
No matter, we'll get them on the first
arithmetic operation.
But wait a minute, just what was the value
of that uninitialized NaN?
The one we were seeking was the all 1's NaN.
The reason for this was that, for most
computers, it is easy to fill memory with
all zeros or all ones. Or even all copies
of some particular byte. But filling memory
with all values of anything more complex
than that involves copying from a register
or one place in memory to another. And that
is a much slower operation.
So, we wanted all 1's. That would make it
easy & fast.
But, as it happened, we recommended (in the
sense of 'should') that the all 1's NaN be
a quiet NaN. The reason for this is that
the most common thing one does with a
signalling NaN is to quiet it. If we had
to do that by turning a "I'm a signalling
NaN" bit from a 1 to a zero there was a
danger of turning a NaN (with only that
bit set) into an infinity.
The technical term for this was "It would
be bad".
So the (strong) recommendation was that the
bit that distinguished signalling NaNs from
quiet ones take on the values 0 for signalling
& 1 for quiet. That way there would always
be a quiet NaN to 'land on' when one quiets
some valid signalling NaN.
So the all 1's NaN would not do. It had to
be something else. It had to be something
that had ones in some places (where the
exponent was) & at least one zero elsewhere
(the signalling bit).
But, single precision floating-point numbers
have 7-bit exponents. Doubles have 11. And
quads have 15. In each case the position of
the signalling bit is (recommended to be) 2
bits to the right of the right most exponent
bit. Counting the sign bit (just for byte
alignment) that means that 10, 14, or 18 bits
matter. The rest don't.
But that means that we have to fill memory
with a value that presumes we know which type
will be incorrectly referenced there.
How can we know that?
Further, some systems align 16, 32, & 64 bit
memory references on 16, 32, or 64 bit aligned
memory locations. Some don't.
So not only do we have to know which type will
be incorrectly referenced, we have to know its
memory alignment.
If we get either one wrong, the bit pattern
will just look like some otherwise innocent
floating-point number.
As the reference is presumed to be incorrect in
the first place, how can we know how or why it
is incorrect?
Let's see, I may have missed something but I
think that's most of it.
Some of them may not apply to your computer.
But I guarantee some of them do.
So...
-- We can't count on systems triggering
an invalid trap if they encounter a
signalling NaN because loads are not
required to be typed.
-- We can't count on knowing when we are
touching a NaN because copies (& negates)
are not required to go through the ALU.
-- Even if we only count on trapping on
arithmetic operations, some (like negate)
are optimized out.
-- We can't fill memory with all 1's
because that is a quiet NaN.
-- The kernal people won't fill it with
any more complex pattern because it is
noticably slower to do so.
-- Even if they would, we could only
catch invalid references to a NaN of a
known type & alignment. All others
would slip through as some other number.
When all was said & done, the remaining diagnostic
value of what could be done if you met all these
limitations was considered to be of far less value
than the limitations themselves.
So we had to give up on it.
Still, some enterprising compiler writer or debugger
writer out there COULD do something along these lines.
It wouldn't buy them much but it would be interesting.
So, I ask again: Please name anyone who is doing this.
I'd really like to talk to them about it.
So would Prof Kahan.
<End of long sad story>
Dan