Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: (long) sNaNs not what they could be...



Dan,

Thanks for sharing the story.  It is informative.

George
On Oct 15, 2010, at 5:01 PM, Dan Zuras Intervals wrote:

> 	For anyone not interested in this topic, it
> 	will be a long diatribe on the inadequacies
> 	of NaNs as a diagonostic tool in 754.
> 
> 	I apologise for posting it in this forum.
> 
> 	Feel free to ignore & delete.
> 
> 	For the rest (if any) ...
> 
>> Date: 15 Oct 2010 21:35:26 +0100
>> From: "N.M. Maclaren" <nmm1@xxxxxxxxx>
>> To: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
>> Cc: Lee Winter <lee.j.i.winter@xxxxxxxxx>, stds-1788@xxxxxxxxxxxxxxxxx
>> Subject: Re: Fw: Useless sNaNs... or useful?
>> 
>> On Oct 15 2010, Dan Zuras Intervals wrote:
>>>>>> 
>>>>>> One important use is to initialize all floating point variables to
>>>>>> signaling NaNs. If they are inadvertently not properly initialized
>>> 
>>> =09What you & Ian probably don't know is that
>>> =09I, along with Prof Kahan & Bob, advocated
>>> =09for consistent behaviors for signalling NaNs
>>> =09that would permit exactly this application.
>> 
>> Good.  That would restore functionality that some compilers had
>> in 1970, though signalling NaNs is only one of many ways to achieve
>> the objective.  It does, however, have the advantage that it has no
>> performance impact on code without such errors.
>> 
>>> =09As it happened there were sound technical
>>> =09reasons for features that made it infeasible.
>>> =09All of which were thought to be more important
>>> =09than this application.  All of which I can
>>> =09describe for you in detail but hesitate to do
>>> =09so in this forum.
>> 
>> I am unaware of any, though I am aware of a great
>> many UNSOUND technical reasons :-(
>> 
>> Please could you describe the sound technical reasons?
>> If you are reluctant to do so in a semi-public forum,
>> Email will do.
>> 
>> 
>> regards,
>> Nick Maclaren.
> 
> 	I will outline it.
> 
> 	But even that outline will take some time to
> 	explain.
> 
> 	What we were after was a 'touch it & die NaN'.
> 	Something which even dereferencing would cause
> 	an invalid trap.  Presumably to a debugger or
> 	to signal the use of an uninitialized variable.
> 
> 	The method would be to fill memory with this
> 	fatal signalling NaN so that any read access
> 	would explode the mine.  If your first use of
> 	memory was to write to it, you were safe.  But
> 	if you read from it you would die the death of
> 	the uninitialized memory signalling NaN.
> 
> 	Sounds simple enough.
> 
> 	Why was that hard to do?
> 
> 	Well, on most systems the load instruction is
> 	not typed.  It has a width but not a type.
> 	Thus, if I am reading, say, a 32-bit quantity
> 	out of memory I generally don't know whether
> 	it is an integer, 4 characters, a single
> 	precision floating-point number, or part of
> 	a larger structure (either a larger floating-
> 	point number or some non-floating-point
> 	structure).
> 
> 	So, die on load was not really feasible.
> 
> 	No matter, it would be sufficient to die on
> 	first floating-point touch.
> 
> 	Which would be fine if all floating-point
> 	touches went through the floating-point ALU.
> 	Alas, on many systems (Intel included) there
> 	are 3 floating-point operations that do not.
> 	They are copy, negate, & absolute value.
> 	They can avoid the ALU because they are just
> 	bit copies with a possible modification of
> 	the sign bit.  And they are singled out in
> 	the standard for this reason.
> 
> 	Of these the most important is copy.  It is
> 	used on assignment.
> 
> 	Or not.  You see, modern optimizers are such
> 	that copies are generally eliminated except
> 	in rare cases.  So, even if we were to 'arm'
> 	copies to trigger invalid we can't count on
> 	them actually being there.
> 
> 	It is a bit more complex, but much the same
> 	can be said for negate.  Unary negates are
> 	mostly eliminated by manipulation of prior
> 	or subsequent add-like operations (changing
> 	add to subtract or one kind of FMA to
> 	another).
> 
> 	Absolute value is generally safe but also
> 	not often used.
> 
> 	So these operations provide a hole through
> 	which an uninitialized value can slip
> 	unnoticed.
> 
> 	No matter, we'll get them on the first
> 	arithmetic operation.
> 
> 	But wait a minute, just what was the value
> 	of that uninitialized NaN?
> 
> 	The one we were seeking was the all 1's NaN.
> 	The reason for this was that, for most
> 	computers, it is easy to fill memory with
> 	all zeros or all ones.  Or even all copies
> 	of some particular byte.  But filling memory
> 	with all values of anything more complex
> 	than that involves copying from a register
> 	or one place in memory to another.  And that
> 	is a much slower operation.
> 
> 	So, we wanted all 1's.  That would make it
> 	easy & fast.
> 
> 	But, as it happened, we recommended (in the
> 	sense of 'should') that the all 1's NaN be
> 	a quiet NaN.  The reason for this is that
> 	the most common thing one does with a
> 	signalling NaN is to quiet it.  If we had
> 	to do that by turning a "I'm a signalling
> 	NaN" bit from a 1 to a zero there was a
> 	danger of turning a NaN (with only that
> 	bit set) into an infinity.
> 
> 	The technical term for this was "It would
> 	be bad".
> 
> 	So the (strong) recommendation was that the
> 	bit that distinguished signalling NaNs from
> 	quiet ones take on the values 0 for signalling
> 	& 1 for quiet.  That way there would always
> 	be a quiet NaN to 'land on' when one quiets
> 	some valid signalling NaN.
> 
> 	So the all 1's NaN would not do.  It had to
> 	be something else.  It had to be something
> 	that had ones in some places (where the
> 	exponent was) & at least one zero elsewhere
> 	(the signalling bit).
> 
> 	But, single precision floating-point numbers
> 	have 7-bit exponents.  Doubles have 11.  And
> 	quads have 15.  In each case the position of
> 	the signalling bit is (recommended to be) 2
> 	bits to the right of the right most exponent
> 	bit.  Counting the sign bit (just for byte
> 	alignment) that means that 10, 14, or 18 bits
> 	matter.  The rest don't.
> 
> 	But that means that we have to fill memory
> 	with a value that presumes we know which type
> 	will be incorrectly referenced there.
> 
> 	How can we know that?
> 
> 	Further, some systems align 16, 32, & 64 bit
> 	memory references on 16, 32, or 64 bit aligned
> 	memory locations.  Some don't.
> 
> 	So not only do we have to know which type will
> 	be incorrectly referenced, we have to know its
> 	memory alignment.
> 
> 	If we get either one wrong, the bit pattern
> 	will just look like some otherwise innocent
> 	floating-point number.
> 
> 	As the reference is presumed to be incorrect in
> 	the first place, how can we know how or why it
> 	is incorrect?
> 
> 
> 	Let's see,  I may have missed something but I
> 	think that's most of it.
> 
> 	Some of them may not apply to your computer.
> 
> 	But I guarantee some of them do.
> 
> 	So...
> 
> 		-- We can't count on systems triggering
> 		an invalid trap if they encounter a
> 		signalling NaN because loads are not
> 		required to be typed.
> 
> 		-- We can't count on knowing when we are
> 		touching a NaN because copies (& negates)
> 		are not required to go through the ALU.
> 
> 		-- Even if we only count on trapping on
> 		arithmetic operations, some (like negate)
> 		are optimized out.
> 
> 		-- We can't fill memory with all 1's
> 		because that is a quiet NaN.
> 
> 		-- The kernal people won't fill it with
> 		any more complex pattern because it is
> 		noticably slower to do so.
> 
> 		-- Even if they would, we could only
> 		catch invalid references to a NaN of a
> 		known type & alignment.  All others
> 		would slip through as some other number.
> 
> 	When all was said & done, the remaining diagnostic
> 	value of what could be done if you met all these
> 	limitations was considered to be of far less value
> 	than the limitations themselves.
> 
> 	So we had to give up on it.
> 
> 	Still, some enterprising compiler writer or debugger
> 	writer out there COULD do something along these lines.
> 	It wouldn't buy them much but it would be interesting.
> 
> 	So, I ask again: Please name anyone who is doing this.
> 
> 	I'd really like to talk to them about it.
> 
> 	So would Prof Kahan.
> 
> 	<End of long sad story>
> 
> 
> 				Dan

Dr. George F. Corliss
Electrical and Computer Engineering
Marquette University
P.O. Box 1881
1515 W. Wisconsin Ave
Milwaukee WI 53201-1881 USA
414-288-6599; GasDay: 288-4400; Fax 288-5579
George.Corliss@xxxxxxxxxxxxx
www.eng.mu.edu/corlissg