Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

(long) sNaNs not what they could be...



	For anyone not interested in this topic, it
	will be a long diatribe on the inadequacies
	of NaNs as a diagonostic tool in 754.

	I apologise for posting it in this forum.

	Feel free to ignore & delete.

	For the rest (if any) ...

> Date: 15 Oct 2010 21:35:26 +0100
> From: "N.M. Maclaren" <nmm1@xxxxxxxxx>
> To: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
> Cc: Lee Winter <lee.j.i.winter@xxxxxxxxx>, stds-1788@xxxxxxxxxxxxxxxxx
> Subject: Re: Fw: Useless sNaNs... or useful?
> 
> On Oct 15 2010, Dan Zuras Intervals wrote:
> >> >>
> >> >> One important use is to initialize all floating point variables to
> >> >> signaling NaNs. If they are inadvertently not properly initialized
> >
> >=09What you & Ian probably don't know is that
> >=09I, along with Prof Kahan & Bob, advocated
> >=09for consistent behaviors for signalling NaNs
> >=09that would permit exactly this application.
> 
> Good.  That would restore functionality that some compilers had
> in 1970, though signalling NaNs is only one of many ways to achieve
> the objective.  It does, however, have the advantage that it has no
> performance impact on code without such errors.
> 
> >=09As it happened there were sound technical
> >=09reasons for features that made it infeasible.
> >=09All of which were thought to be more important
> >=09than this application.  All of which I can
> >=09describe for you in detail but hesitate to do
> >=09so in this forum.
> 
> I am unaware of any, though I am aware of a great
> many UNSOUND technical reasons :-(
> 
> Please could you describe the sound technical reasons?
> If you are reluctant to do so in a semi-public forum,
> Email will do.
> 
> 
> regards,
> Nick Maclaren.

	I will outline it.

	But even that outline will take some time to
	explain.

	What we were after was a 'touch it & die NaN'.
	Something which even dereferencing would cause
	an invalid trap.  Presumably to a debugger or
	to signal the use of an uninitialized variable.

	The method would be to fill memory with this
	fatal signalling NaN so that any read access
	would explode the mine.  If your first use of
	memory was to write to it, you were safe.  But
	if you read from it you would die the death of
	the uninitialized memory signalling NaN.

	Sounds simple enough.

	Why was that hard to do?

	Well, on most systems the load instruction is
	not typed.  It has a width but not a type.
	Thus, if I am reading, say, a 32-bit quantity
	out of memory I generally don't know whether
	it is an integer, 4 characters, a single
	precision floating-point number, or part of
	a larger structure (either a larger floating-
	point number or some non-floating-point
	structure).

	So, die on load was not really feasible.

	No matter, it would be sufficient to die on
	first floating-point touch.

	Which would be fine if all floating-point
	touches went through the floating-point ALU.
	Alas, on many systems (Intel included) there
	are 3 floating-point operations that do not.
	They are copy, negate, & absolute value.
	They can avoid the ALU because they are just
	bit copies with a possible modification of
	the sign bit.  And they are singled out in
	the standard for this reason.

	Of these the most important is copy.  It is
	used on assignment.

	Or not.  You see, modern optimizers are such
	that copies are generally eliminated except
	in rare cases.  So, even if we were to 'arm'
	copies to trigger invalid we can't count on
	them actually being there.

	It is a bit more complex, but much the same
	can be said for negate.  Unary negates are
	mostly eliminated by manipulation of prior
	or subsequent add-like operations (changing
	add to subtract or one kind of FMA to
	another).

	Absolute value is generally safe but also
	not often used.

	So these operations provide a hole through
	which an uninitialized value can slip
	unnoticed.

	No matter, we'll get them on the first
	arithmetic operation.

	But wait a minute, just what was the value
	of that uninitialized NaN?

	The one we were seeking was the all 1's NaN.
	The reason for this was that, for most
	computers, it is easy to fill memory with
	all zeros or all ones.  Or even all copies
	of some particular byte.  But filling memory
	with all values of anything more complex
	than that involves copying from a register
	or one place in memory to another.  And that
	is a much slower operation.

	So, we wanted all 1's.  That would make it
	easy & fast.

	But, as it happened, we recommended (in the
	sense of 'should') that the all 1's NaN be
	a quiet NaN.  The reason for this is that
	the most common thing one does with a
	signalling NaN is to quiet it.  If we had
	to do that by turning a "I'm a signalling
	NaN" bit from a 1 to a zero there was a
	danger of turning a NaN (with only that
	bit set) into an infinity.

	The technical term for this was "It would
	be bad".

	So the (strong) recommendation was that the
	bit that distinguished signalling NaNs from
	quiet ones take on the values 0 for signalling
	& 1 for quiet.  That way there would always
	be a quiet NaN to 'land on' when one quiets
	some valid signalling NaN.

	So the all 1's NaN would not do.  It had to
	be something else.  It had to be something
	that had ones in some places (where the
	exponent was) & at least one zero elsewhere
	(the signalling bit).

	But, single precision floating-point numbers
	have 7-bit exponents.  Doubles have 11.  And
	quads have 15.  In each case the position of
	the signalling bit is (recommended to be) 2
	bits to the right of the right most exponent
	bit.  Counting the sign bit (just for byte
	alignment) that means that 10, 14, or 18 bits
	matter.  The rest don't.

	But that means that we have to fill memory
	with a value that presumes we know which type
	will be incorrectly referenced there.

	How can we know that?

	Further, some systems align 16, 32, & 64 bit
	memory references on 16, 32, or 64 bit aligned
	memory locations.  Some don't.

	So not only do we have to know which type will
	be incorrectly referenced, we have to know its
	memory alignment.

	If we get either one wrong, the bit pattern
	will just look like some otherwise innocent
	floating-point number.

	As the reference is presumed to be incorrect in
	the first place, how can we know how or why it
	is incorrect?


	Let's see,  I may have missed something but I
	think that's most of it.

	Some of them may not apply to your computer.

	But I guarantee some of them do.

	So...

		-- We can't count on systems triggering
		an invalid trap if they encounter a
		signalling NaN because loads are not
		required to be typed.

		-- We can't count on knowing when we are
		touching a NaN because copies (& negates)
		are not required to go through the ALU.

		-- Even if we only count on trapping on
		arithmetic operations, some (like negate)
		are optimized out.

		-- We can't fill memory with all 1's
		because that is a quiet NaN.

		-- The kernal people won't fill it with
		any more complex pattern because it is
		noticably slower to do so.

		-- Even if they would, we could only
		catch invalid references to a NaN of a
		known type & alignment.  All others
		would slip through as some other number.

	When all was said & done, the remaining diagnostic
	value of what could be done if you met all these
	limitations was considered to be of far less value
	than the limitations themselves.

	So we had to give up on it.

	Still, some enterprising compiler writer or debugger
	writer out there COULD do something along these lines.
	It wouldn't buy them much but it would be interesting.

	So, I ask again: Please name anyone who is doing this.

	I'd really like to talk to them about it.

	So would Prof Kahan.

	<End of long sad story>


				Dan