[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
Annex for Optional Floating-Point Formats...
To all & sundry,
Much of the subcommittee's time in the last month as been
devoted to the subject of other floating-point formats.
History has shown us that implementers of highly specialized
applications often choose their own floating-point formats in
a real or perceived need for something more befitting their
application.
History has also shown us that they often choose unwisely.
Below, you will find the proposed text for an optional annex
(numbered 'Z', until we know if we want it & where we are
going to put it) that introduces the notion of optional binary
floating-point formats.
If you can make the time, please read it before the meeting
on Thursday. And, when you read it, I would like you to keep
several things in mind.
* The entire annex is OPTIONAL. Therefore, this does
not constrain anyone to do anything.
* The formats described can be implemented in any way
desired with any range & precision desired through
a rather large variety of choices.
* The only 'shalls' in the entire annex concern the
arithmetic. That is, if you choose to implement one
or more of these formats, you SHALL implement the
arithmetic as in the rest of 754.
* There is a section devoted to how to choose a format
that will meet your needs. And, how NOT to screw
yourself over with an unwise choice.
* Near the end, there is a SUGGESTED tower of formats
(somewhat improved since you last saw it) that meets
a general class of good figures of merit for floating
point datatypes.
* Finally, it concludes with a description of the sorts
of problems that are well addressed by these formats.
The Reader's Digest version of all this is that this annex
contains a SUGGESTED tower of formats in an OPTIONAL annex that
otherwise ONLY constrains the implementer to provide good
arithmetic. If you focus your attention on the tower you will
miss the rest of the annex.
Everyone (David Hough, Prof Kahan, Jeff Kidder, & I) has
contributed useful ideas to this. And, this is the 6th version
of this annex.
But the writing is mine.
So yell at me.
But not till Thursday. I'm way overbooked this week.
Enjoy,
Dan Zuras
-------------------------------------------------------------------
08/17/04 (1)
ANNEX Z - Optional Binary Floating-Point Formats
In addition to the formats previously defined, this standard
allows for optional binary formats to be provided at the
descretion of the implementer.
Z.1 OPTIONAL BINARY FORMATS
(optional) (optional)
<--- 64 ---> 1 <- Ebits -> <- Fbits -> <---- z ----->
+------------+------+-----------+-----------+--------------+
| descriptor | sign | exponent | fraction | zero padding |
+------------+------+-----------+-----------+--------------+
<-------------------- L -------------------->
1 <--- 6 ---> <- 8 -> <- 49 ->
+--------+-----------+-------+--------+
Descriptor: | hidden | alignment | Ebits | Fbits |
+--------+-----------+-------+--------+
<--------------- 64 ---------------->
These formats shall be characterized entirely by the 5-tuple
{prepend, hidden, alignment, Ebits, Fbits} where:
prepend = a boolean that is true if the descriptor is
to be prepended and false if not
hidden = a boolean that is true if there is a hidden
bit in the format and false if not
alignment = exponent of the power of two (in bytes) to
which a number in this format is to be aligned
Ebits = number of bits in the exponent
Fbits = number of bits in the fraction
From these can be deduced the structural parameters:
L = (2^(alignment+3))*
Ceiling[(1 + Ebits + Fbits)/2^(alignment+3)]
= width of a number in this format
z = L - (1 + Ebits + Fbits) = the number (possibly zero)
of zeros bits padded to a number when stored
And, the arithmetic parameters:
bias = 2^(Ebits - 1) - 1 = exponent bias
Emax = 2^Ebits - 2
08/17/04 (2)
if (hidden) then
Emin = 1
else
Emin = 0
endif
(PURPLE) This descriptor permits an exponent as large as 255
(PURPLE) bits and a fraction as long as half a Petabit which
(PURPLE) would require 70 TeraBytes to store one number. The
(PURPLE) exponent range of 2^2^255 is so large that it fits
(PURPLE) neatly between two well known numbers: 1 googol =
(PURPLE) 10^100 << (10^100)^38 < 2^2^255 ~ 10^(2*10^76) <<
(PURPLE) (2^2^255)^(5*10^23) < 10^10^100 = 1 googolplex. As
(PURPLE) the total number of bits available in the observable
(PURPLE) universe must be substantially smaller than 10^254,
(PURPLE) that should be sufficient for the time being. I feel
(PURPLE) quite comfortable with the range & precision available
(PURPLE) here & leaving anyone who needs anything beyond that
(PURPLE) to their own devices.
The interpretation of numbers in this format is:
1) if exponent == Emax + 1 && fraction == 0 then
value = (-1)^sign*Infinity
2) if exponent == Emax + 1 && fraction != 0 then
value = NaN
3) if Emin <= exponent && exponent <= Emax then
if (hidden) then
value = (-1)^sign*2^(exponent - bias)*
(1 + fraction/2^Fbits)
else
value = (-1)^sign*2^(exponent - bias)*
(fraction/2^Fbits)
endif
4) if exponent == 0 then
if (hidden) then
value = (-1)^sign*2^(Emin - bias)*
(fraction/2^Fbits)
else
value = (-1)^sign*2^(exponent - bias)*
(fraction/2^Fbits)
endif
Note that the standard formats are characterized by the
following descriptors:
Format | prepend hidden alignment Ebits Fbits
-------|--------+------+---------+-----+-----
Single | false true 2 8 23
Double | false true 3 11 52
Quad | false true 4 15 112
08/17/04 (3)
Also, the extended formats may be described similarly with a
descriptor that varies from implementation to implementation.
It shall be possible for the user to obtain a variable in any
format of this kind by the declaration (for declarative
languages) or promotion (for non-declarative languages)
binaryFP(prepend, hidden, alignment, Ebits, Fbits).
While manipulating numbers in these formats, an implementation
may choose to represent them in some other format. For example,
an implementation may choose to
* separate the sign, exponent, and/or fraction fields
along word or cache line boundaries or to fit them
into specialized registers,
* represent the fraction as a collection of residues
modulo a suitably chosen collection of moduli, or
* represent the fraction as a collection of suitably
scaled floating-point numbers in a smaller format.
When reading or writing numbers in these formats or transmitting
or receiving them, an implementation may choose to
* represent the descriptor in ASCII,
* suitably truncate any padding on output, or
* suitably increase padding on input.
In all cases where such manipulations are chosen, operations
shall behave as if in the specified format including respecting
over/underflow thresholds and specified rounding precision.
Also, in all cases chosen, an implementation shall arrange that
such numbers always appear to the user as if they were in the
specified format. This shall include such things as
* printing of a number in hexadecimal,
* examining any part of a number as if with a Fortran
equivalence or C union, and
* changing any part of a number as if with a Fortran
equivalence or C union
but exclude such things as debuggers, crash dump analyzers, or
other tools designed to examine the implementation environment
directly.
08/17/04 (4)
Z.2 ENVIRONMENTAL INQUIRIES
In addition, the implementer may wish to support a proper subset
of these formats in hardware. Therefore, if the user merely
wants access to a format that is AT LEAST as good as the format
specified, the environmental inquiry subsumptiveFP(prepend,
hidden, alignment, Ebits, Fbits) shall return the descriptor of
the least subsumptive format that is well supported in this
implementation. In this case, binaryFP(subsumptiveFP(prepend,
hidden, alignment, Ebits, Fbits)) is what the user wants.
Within any given format F = {prepend, hidden, alignment, Ebits,
Fbits}, all operations shall be supported as defined in this
standard. In particular, nextafter (nextup?) shall be
supported. Also, all conversions between F and all other
supported formats, integers, and all other available optional
formats shall be supported. Conversions between F and decimal
strings shall be exactly rounded and as defined in the standard.
For any format that has a non-empty padding field, zeros shall
be placed there on stores and any non-zero bits found on loads
shall be ignored.
Z.3 CHOOSING A BALANCED FORMAT
In deciding which format to choose to solve a given problem, the
user must seek some balance in its description.
Note that any format F = {prepend, hidden, alignment, Ebits,
Fbits}, for which the exponent field is too large relative to
its fraction risks that some algorithms will be slow to converge
for the accuracy of their results especially when the result is
zero. Similarly, any format for which the exponent field is too
small risks that some algorithms may terminate early returning
less accurate results than expected. To prevent this problem it
should be possible to raise ULP(1.0) to a modest power (say
>=10) without underflow.
A format with an unusual number of total bits may incur
additional load/store expense on modern wide memory busses.
Therefore, it is often desireable to pad a format out to some
number of bytes of the form a*2^k where 'a' is a small odd
number like 1, 3, or 5, so that loads and stores may be aligned
to hardware memory constructs such as cache lines. The desired
alignment in this case is 'k'.
08/17/04 (5)
To solve some problems accurately in a given format F, it is
often desireable to perform some calculations in a larger format
known as the extended of F or extended(F). To perform these
calculations, it is sufficient to have at least 1 more exponent
bit in extended(F) than in F and 2 or more is desireable. In
addition, it is sufficient to have the fraction bits in
extended(F) be nearly Ebits(F) + Fbits(F) and it is desirable to
have it exceed (6/5)*Fbits(F). One extra exponent bit permits
an FMA to be evaluated without risk of over/underflow and 2
extra bits permits the evaluation of a 4th order polynomial of
modest coefficients. Fbits(extended(F)) >= Ebits(F) + Fbits(F)
permits accurate evaluation of x^y in cases where it doesn't
over/underflow when returned to the precision and range of F.
It is also often desirable to perform calculations in a format
at least twice as large as a given format F. This format is
known as the double of F or double(F). In double(F), it is
sufficient to have at least 2 more exponent bits than in F and 3
is desireable. In addition, it is sufficient to have at least
twice as many fraction bits as in F and 5 or 6 more bits is
desireable. Three extra exponent bits permits the accurate
evaluation of an 8th order polynomial of modest coefficients
without risk of over/underflow. Also, Fbits(double(F)) >=
2*Fbits(F) permits complex multiply to be correctly rounded to
the desired precision when evaluated in the obvious way.
In addition to the standard operations supported, an
implementation that supports both an F and a double(F) shall
support the operations
* dprod(x,y) which takes two operands in the F format
and returns the exact product in the double(F)
format, and
* sadd(x,y) which takes two operands in the double(F)
format and returns the sum rounded once to the F
format.
* ssubtract(x,y) which takes two operands in the
double(F) format and returns the difference rounded
once to the F format.
Implementations that choose to support formats that prepend
descriptors to operands shall provide an additional operand for
each supported operation that specifies the descriptor of the
desired result. Thus, expressions that are evaluated in an
increased range or precision and return a final result reduced
to the desired precision with only a single round may be easily
implemented.
08/17/04 (6)
If an implementation supports prepending of the descriptor to
optional formats, it shall support mixed operations in the
following manner
* all monadic operations op(x) as op(x, d) where 'd' is
the descriptor of the desired result,
* all dyadic operations op(x, y) as op(x, y, d) where
'd' is the descriptor of the desired result, and
* FMA(x, y, z) as FMA(x, y, z, d) where 'd' is the
descriptor of the desired result.
For x1, y1 in F and x2, y2 in double(F) we have that
dprod(x1,y1) = multiply(x1,y1,descriptor(double(F))),
sadd(x2,y2) = add(x2,y2,descriptor(F)), and
ssubtract(x2,y2) = subtract(x2,y2,descriptor(F)).
So, for complex multiply we have for a + i*b, c + i*d in complex
FxF we have:
(a + i*b)*(c+i*d) =
ssubtract(dprod(a, c), dprod(b, d))
+ i*sadd(dprod(b, c), dprod(a, d)).
Some good figures of merit for a collection of formats are:
(2^Ebits)/Fbits >= 10
Ebits(extended(F)) >= Ebits(F) + 2
Ebits(double(F)) >= Ebits(F) + 3
Fbits(extended(F)) >= (6/5)*Fbits(F) + 1
Fbits(double(F)) >= 2*Fbits(F) + 4
08/17/04 (7)
Z.4 THE STANDARD TOWER OF FORMATS
In keeping with these principles, the following subset of all
possible F shall be considered standard optional formats.
Format | hidden alignment Ebits Fbits L(bytes)
-------|+------+---------+-----+----------------+----------
F0 | true 2 8 23 4
F1 | true 2(*) 9 30 5
F2 | true 2(*) 10 37 6
F3 | true 3 11 52 8
F4 | true 3(*) 13 66 10
F5 | true 3(*) 14 81 12
F6 | true 4 15 112 16
F7 | true 4(*) 16 143 20
F8 | true 4(*) 17 174 24
F9 | true 5 18 237 32
F10 | true 5(*) 19 300 40
F11 | true 5(*) 20 363 48
F12 | true 6 21 490 64
... in general for F4 & beyond for k >= 0 ...
F(3k+4)| true k+3(*) 3k+13 5*2^(k+4)-(3k+14) 5*2^(k+1)
F(3k+5)| true k+3(*) 3k+14 3*2^(k+5)-(3k+15) 3*2^(k+2)
F(3k+6)| true k+4 3k+15 2^(k+7)-(3k+16) 2^(k+4)
(*) For these formats, an implementer may choose to
increase or decrease alignment to suit.
Note that the F0 format is standard single, the F3 format is
standard double, and F6 is standard quad.
In general, for each format, F(k), on this list, that format's
standard extended shall be F(k+2). Also, for each format, F(k),
that format's standard double shall be F(k+3).
Z.5 EFFICIENT USE OF PRECISION
In order to return a result accurate to the precision of F,
some algorithms must repeat the same calculations in an ever
increasing series of precisions until an unusually restrictive
error bound is met. To minimize the total amount of work
required, it is desireable to perform those calculations in a
series of formats which increase in precision by approximately
Sqrt[2] ~ 1.4 with each step. Either of the subsets {F3, F4,
F6, F7, F9, F10, F12, ...} or {F3, F5, F6, F8, F9, F11, F12,
...} approximate this series.
08/17/04 (8)
Other algorithms must perform a similar series of calculations
of increasing precision until an approximation of the precision
needed may be accurately estimated after which it may jump
directly to the final precision for one more calculation to
return the desired result. In such cases, the final format
chosen should not be too much larger than that desired so as to
minimize unneeded overhead.
In addition, all known efficient algorithms for implementing
large arithmetic operations (such as multiply) do not increase
smoothly in run time as a function of the precision but instead
exhibit large gaps in performance.
For both these reasons, there should be several available
precisions for each power of two in size to minimize both the
actual cost of a given calculation and the chance that one will
incur the cost of a large performance gap.
Providing three precisions for each power of 2, {4, 5, and
6}*2^k, should be sufficient to minimize these costs without too
much implementation complexity.
(PURPLE) With only two precisions available per binade, the
(PURPLE) worst case performance gap can occasionally be as large
(PURPLE) as (Sqrt[2])^2 or 2. With three precisions available,
(PURPLE) the worst case performance gap can be 2^(2/3) < 1.6 and
(PURPLE) this should be very rare.