Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: (long) sNaNs not what they could be...



On Oct 15 2010, Dan Zuras Intervals wrote:

	For anyone not interested in this topic, it
	will be a long diatribe on the inadequacies
	of NaNs as a diagonostic tool in 754.

Thank you for explaining.  I will respond, explaining why
I say that those are all unsound technical reasons - though
some of them were once sound.  Please do not take offence,
though I make no apologies for being a heretic!

	What we were after was a 'touch it & die NaN'.
	Something which even dereferencing would cause
	an invalid trap.  Presumably to a debugger or
	to signal the use of an uninitialized variable.

Reasonable in theory, but not in practice.  It's incompatible
with Fortran's association model, which dates from 1966
(arguably even 1958).

	The method would be to fill memory with this
	fatal signalling NaN so that any read access
	would explode the mine.  If your first use of
	memory was to write to it, you were safe.  But
	if you read from it you would die the death of
	the uninitialized memory signalling NaN.

Been there - used that :-)

	Well, on most systems the load instruction is
	not typed.  It has a width but not a type.

Eh?  That is NOT the case.  Only a few systems have worked
that way, though I accept that it now applies to Intel's SSE,
which is now dominating.

	So, die on load was not really feasible.

	No matter, it would be sufficient to die on
	first floating-point touch.

Right.  Which is the solution that is required for Fortran,
anyway.

	Which would be fine if all floating-point
	touches went through the floating-point ALU.
	Alas, on many systems (Intel included) there
	are 3 floating-point operations that do not.
	They are copy, negate, & absolute value. ...
	Of these the most important is copy.  It is
	used on assignment.

	Or not.  You see, modern optimizers are such
	that copies are generally eliminated except
	in rare cases.  So, even if we were to 'arm'
	copies to trigger invalid we can't count on
	them actually being there.

	So these operations provide a hole through
	which an uninitialized value can slip
	unnoticed.

Well, yes.  But that's been true for 50 years, and is the
main reason that the IEEE 754 assumption of bitwise
predictability was and is a serious mistake.  Signalling
NaNs add nothing whatsoever to the problem.

	The one we were seeking was the all 1's NaN.
	The reason for this was that, for most
	computers, it is easy to fill memory with
	all zeros or all ones.  Or even all copies
	of some particular byte.  But filling memory
	with all values of anything more complex
	than that involves copying from a register
	or one place in memory to another.  And that
	is a much slower operation.

Eh?  There have been no advantages of all ones for over
40 years, and no advantages for a single byte for many
decades.  They had gone in the mainframe world by about
1975 (earlier in the case of the dominating IBM System/370)
and, while microprocessors reinvented many of the
mistakes of the 1950s, they had caught up by 1990.

	But that means that we have to fill memory
	with a value that presumes we know which type
	will be incorrectly referenced there.

	How can we know that?

Because almost all high-level languages specify it.  Fortran,
Cobol, C++ and almost everything else do.  C malloc doesn't,
and nor do some library interfaces, but they are strongly
deprecated for use in programs where reliability is important.

There is also a technical solution to the multiple-format
problem, given that the only IEEE 754 cases were/are 32-, 64-
and perhaps 128-bit and you can assume comparable layout.
All you have to do is to ensure that the word you use for
initialisation has all ones in the 128-bit exponent area,
the signalling bit is NOT in that, and replicate the 32-bit
format.

	Further, some systems align 16, 32, & 64 bit
	memory references on 16, 32, or 64 bit aligned
	memory locations.  Some don't.

All languages that I know of do, including many assemblers,
and essentially nobody writes in machine code any longer.
Fortran does (essentially) allow 64-bit access with a 32-bit
alignment, as do many systems, but that's not a problem (see
above).

16-bit systems are essentially dead, except in the smaller
embedded systems, few of which use floating-point (and rarely
use strict IEEE 754 if they do).

	When all was said & done, the remaining diagnostic
	value of what could be done if you met all these
	limitations was considered to be of far less value
	than the limitations themselves.

The reason for that seems to be that you were looking for a
hardware-only solution to a problem caused largely by software.
There is no way that the hardware can second-guess the intent
of the software.

	Still, some enterprising compiler writer or debugger
	writer out there COULD do something along these lines.
	It wouldn't buy them much but it would be interesting.

WATFIV and several other compilers did it, as someone said.
Intel currently have a '-check uninit' option, and so do several
other compilers, though I doubt that they use signalling NaNs
for the purpose.

However, the real reason that it wouldn't buy them much is that
90% of such problems in 'numeric' code occur for integer and
(in C etc.) pointer data, not floating-point.  And, of course,
100% of them in non-numeric code do.


Regards,
Nick Maclaren.