[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Annex for Optional Floating-Point Formats...



        To all & sundry,


        Much of the subcommittee's time in the last month as been
        devoted to the subject of other floating-point formats.

        History has shown us that implementers of highly specialized
        applications often choose their own floating-point formats in
        a real or perceived need for something more befitting their
        application.

        History has also shown us that they often choose unwisely.

        Below, you will find the proposed text for an optional annex
        (numbered 'Z', until we know if we want it & where we are
        going to put it) that introduces the notion of optional binary
        floating-point formats.

        If you can make the time, please read it before the meeting
        on Thursday.  And, when you read it, I would like you to keep
        several things in mind.

                *  The entire annex is OPTIONAL.  Therefore, this does
                   not constrain anyone to do anything.

                *  The formats described can be implemented in any way
                   desired with any range & precision desired through
                   a rather large variety of choices.

                *  The only 'shalls' in the entire annex concern the
                   arithmetic.  That is, if you choose to implement one
                   or more of these formats, you SHALL implement the
                   arithmetic as in the rest of 754.

                *  There is a section devoted to how to choose a format
                   that will meet your needs.  And, how NOT to screw
                   yourself over with an unwise choice.

                *  Near the end, there is a SUGGESTED tower of formats
                   (somewhat improved since you last saw it) that meets
                   a general class of good figures of merit for floating
                   point datatypes.

                *  Finally, it concludes with a description of the sorts
                   of problems that are well addressed by these formats.

        The Reader's Digest version of all this is that this annex
        contains a SUGGESTED tower of formats in an OPTIONAL annex that
        otherwise ONLY constrains the implementer to provide good
        arithmetic.  If you focus your attention on the tower you will
        miss the rest of the annex.

        Everyone (David Hough, Prof Kahan, Jeff Kidder, & I) has
        contributed useful ideas to this.  And, this is the 6th version
        of this annex.

        But the writing is mine.

        So yell at me.

        But not till Thursday.  I'm way overbooked this week.

        Enjoy,

                                Dan Zuras

        -------------------------------------------------------------------


        08/17/04                                                        (1)


                ANNEX Z - Optional Binary Floating-Point Formats


        In addition to the formats previously defined, this standard
        allows for optional binary formats to be provided at the
        descretion of the implementer.


        Z.1  OPTIONAL BINARY FORMATS


          (optional)                                 (optional)
         <--- 64 --->   1    <- Ebits -> <- Fbits -> <---- z ----->
        +------------+------+-----------+-----------+--------------+
        | descriptor | sign | exponent  | fraction  | zero padding |
        +------------+------+-----------+-----------+--------------+
                      <-------------------- L -------------------->

                            1     <--- 6 ---> <- 8 -> <- 49 ->
                        +--------+-----------+-------+--------+
        Descriptor:     | hidden | alignment | Ebits | Fbits  |
                        +--------+-----------+-------+--------+
                         <--------------- 64 ---------------->

        These formats shall be characterized entirely by the 5-tuple
        {prepend, hidden, alignment, Ebits, Fbits} where:

                  prepend = a boolean that is true if the descriptor is
                        to be prepended and false if not

                   hidden = a boolean that is true if there is a hidden
                        bit in the format and false if not

                alignment = exponent of the power of two (in bytes) to
                        which a number in this format is to be aligned

                    Ebits = number of bits in the exponent

                    Fbits = number of bits in the fraction

        From these can be deduced the structural parameters:

                L = (2^(alignment+3))*
                        Ceiling[(1 + Ebits + Fbits)/2^(alignment+3)]
                        = width of a number in this format

                z = L - (1 + Ebits + Fbits) = the number (possibly zero)
                        of zeros bits padded to a number when stored

        And, the arithmetic parameters:

                bias = 2^(Ebits - 1) - 1 = exponent bias

                Emax = 2^Ebits - 2

        08/17/04                                                        (2)

                if (hidden) then
                        Emin = 1
                else
                        Emin = 0
                endif

        (PURPLE) This descriptor permits an exponent as large as 255
        (PURPLE) bits and a fraction as long as half a Petabit which
        (PURPLE) would require 70 TeraBytes to store one number.  The
        (PURPLE) exponent range of 2^2^255 is so large that it fits
        (PURPLE) neatly between two well known numbers: 1 googol =
        (PURPLE) 10^100 << (10^100)^38 < 2^2^255 ~ 10^(2*10^76) <<
        (PURPLE) (2^2^255)^(5*10^23) < 10^10^100 = 1 googolplex.  As
        (PURPLE) the total number of bits available in the observable
        (PURPLE) universe must be substantially smaller than 10^254,
        (PURPLE) that should be sufficient for the time being.  I feel
        (PURPLE) quite comfortable with the range & precision available
        (PURPLE) here & leaving anyone who needs anything beyond that
        (PURPLE) to their own devices.

        The interpretation of numbers in this format is:

                1) if exponent == Emax + 1 && fraction == 0 then
                        value = (-1)^sign*Infinity

                2) if exponent == Emax + 1 && fraction != 0 then
                        value = NaN

                3) if Emin <= exponent && exponent <= Emax then
                        if (hidden) then
                                value = (-1)^sign*2^(exponent - bias)*
                                        (1 + fraction/2^Fbits)
                        else
                                value = (-1)^sign*2^(exponent - bias)*
                                        (fraction/2^Fbits)
                        endif

                4) if exponent == 0 then
                        if (hidden) then
                                value = (-1)^sign*2^(Emin - bias)*
                                        (fraction/2^Fbits)
                        else
                                value = (-1)^sign*2^(exponent - bias)*
                                        (fraction/2^Fbits)
                        endif

        Note that the standard formats are characterized by the
        following descriptors:

                Format | prepend hidden alignment Ebits Fbits
                -------|--------+------+---------+-----+-----
                Single | false    true      2       8    23
                Double | false    true      3      11    52
                Quad   | false    true      4      15   112

        08/17/04                                                        (3)

        Also, the extended formats may be described similarly with a
        descriptor that varies from implementation to implementation.

        It shall be possible for the user to obtain a variable in any
        format of this kind by the declaration (for declarative
        languages) or promotion (for non-declarative languages)
        binaryFP(prepend, hidden, alignment, Ebits, Fbits).

        While manipulating numbers in these formats, an implementation
        may choose to represent them in some other format.  For example,
        an implementation may choose to

                *  separate the sign, exponent, and/or fraction fields
                   along word or cache line boundaries or to fit them
                   into specialized registers,

                *  represent the fraction as a collection of residues
                   modulo a suitably chosen collection of moduli, or

                *  represent the fraction as a collection of suitably
                   scaled floating-point numbers in a smaller format.

        When reading or writing numbers in these formats or transmitting
        or receiving them, an implementation may choose to

                *  represent the descriptor in ASCII,

                *  suitably truncate any padding on output, or

                *  suitably increase padding on input.

        In all cases where such manipulations are chosen, operations
        shall behave as if in the specified format including respecting
        over/underflow thresholds and specified rounding precision.

        Also, in all cases chosen, an implementation shall arrange that
        such numbers always appear to the user as if they were in the
        specified format.  This shall include such things as

                *  printing of a number in hexadecimal,

                *  examining any part of a number as if with a Fortran
                   equivalence or C union, and

                *  changing any part of a number as if with a Fortran
                   equivalence or C union

        but exclude such things as debuggers, crash dump analyzers, or
        other tools designed to examine the implementation environment
        directly.

        08/17/04                                                        (4)


        Z.2 ENVIRONMENTAL INQUIRIES


        In addition, the implementer may wish to support a proper subset
        of these formats in hardware.  Therefore, if the user merely
        wants access to a format that is AT LEAST as good as the format
        specified, the environmental inquiry subsumptiveFP(prepend,
        hidden, alignment, Ebits, Fbits) shall return the descriptor of
        the least subsumptive format that is well supported in this
        implementation.  In this case, binaryFP(subsumptiveFP(prepend,
        hidden, alignment, Ebits, Fbits)) is what the user wants.

        Within any given format F = {prepend, hidden, alignment, Ebits,
        Fbits}, all operations shall be supported as defined in this
        standard.  In particular, nextafter (nextup?) shall be
        supported.  Also, all conversions between F and all other
        supported formats, integers, and all other available optional
        formats shall be supported.  Conversions between F and decimal
        strings shall be exactly rounded and as defined in the standard.

        For any format that has a non-empty padding field, zeros shall
        be placed there on stores and any non-zero bits found on loads
        shall be ignored.


        Z.3  CHOOSING A BALANCED FORMAT


        In deciding which format to choose to solve a given problem, the
        user must seek some balance in its description.

        Note that any format F = {prepend, hidden, alignment, Ebits,
        Fbits}, for which the exponent field is too large relative to
        its fraction risks that some algorithms will be slow to converge
        for the accuracy of their results especially when the result is
        zero.  Similarly, any format for which the exponent field is too
        small risks that some algorithms may terminate early returning
        less accurate results than expected.  To prevent this problem it
        should be possible to raise ULP(1.0) to a modest power (say
        >=10) without underflow.

        A format with an unusual number of total bits may incur
        additional load/store expense on modern wide memory busses.
        Therefore, it is often desireable to pad a format out to some
        number of bytes of the form a*2^k where 'a' is a small odd
        number like 1, 3, or 5, so that loads and stores may be aligned
        to hardware memory constructs such as cache lines.  The desired
        alignment in this case is 'k'.

        08/17/04                                                        (5)

        To solve some problems accurately in a given format F, it is
        often desireable to perform some calculations in a larger format
        known as the extended of F or extended(F).  To perform these
        calculations, it is sufficient to have at least 1 more exponent
        bit in extended(F) than in F and 2 or more is desireable.  In
        addition, it is sufficient to have the fraction bits in
        extended(F) be nearly Ebits(F) + Fbits(F) and it is desirable to
        have it exceed (6/5)*Fbits(F).  One extra exponent bit permits
        an FMA to be evaluated without risk of over/underflow and 2
        extra bits permits the evaluation of a 4th order polynomial of
        modest coefficients.  Fbits(extended(F)) >= Ebits(F) + Fbits(F)
        permits accurate evaluation of x^y in cases where it doesn't
        over/underflow when returned to the precision and range of F.

        It is also often desirable to perform calculations in a format
        at least twice as large as a given format F.  This format is
        known as the double of F or double(F). In double(F), it is
        sufficient to have at least 2 more exponent bits than in F and 3
        is desireable.  In addition, it is sufficient to have at least
        twice as many fraction bits as in F and 5 or 6 more bits is
        desireable.  Three extra exponent bits permits the accurate
        evaluation of an 8th order polynomial of modest coefficients
        without risk of over/underflow.  Also, Fbits(double(F)) >=
        2*Fbits(F) permits complex multiply to be correctly rounded to
        the desired precision when evaluated in the obvious way.

        In addition to the standard operations supported, an
        implementation that supports both an F and a double(F) shall
        support the operations

                *  dprod(x,y) which takes two operands in the F format
                   and returns the exact product in the double(F)
                   format, and

                *  sadd(x,y) which takes two operands in the double(F)
                   format and returns the sum rounded once to the F
                   format.

                *  ssubtract(x,y) which takes two operands in the
                   double(F) format and returns the difference rounded
                   once to the F format.

        Implementations that choose to support formats that prepend
        descriptors to operands shall provide an additional operand for
        each supported operation that specifies the descriptor of the
        desired result.  Thus, expressions that are evaluated in an
        increased range or precision and return a final result reduced
        to the desired precision with only a single round may be easily
        implemented.

        08/17/04                                                        (6)

        If an implementation supports prepending of the descriptor to
        optional formats, it shall support mixed operations in the
        following manner

                *  all monadic operations op(x) as op(x, d) where 'd' is
                   the descriptor of the desired result,

                *  all dyadic operations op(x, y) as op(x, y, d) where
                   'd' is the descriptor of the desired result, and

                *  FMA(x, y, z) as FMA(x, y, z, d) where 'd' is the
                   descriptor of the desired result.

        For x1, y1 in F and x2, y2 in double(F) we have that

                   dprod(x1,y1) = multiply(x1,y1,descriptor(double(F))),

                    sadd(x2,y2) = add(x2,y2,descriptor(F)), and

               ssubtract(x2,y2) = subtract(x2,y2,descriptor(F)).

        So, for complex multiply we have for a + i*b, c + i*d in complex
        FxF we have:

                (a + i*b)*(c+i*d) =
                                ssubtract(dprod(a, c), dprod(b, d))
                                 + i*sadd(dprod(b, c), dprod(a, d)).

        Some good figures of merit for a collection of formats are:

                   (2^Ebits)/Fbits >= 10

                Ebits(extended(F)) >= Ebits(F) + 2

                  Ebits(double(F)) >= Ebits(F) + 3

                Fbits(extended(F)) >= (6/5)*Fbits(F) + 1

                  Fbits(double(F)) >= 2*Fbits(F) + 4

        08/17/04                                                        (7)


        Z.4  THE STANDARD TOWER OF FORMATS


        In keeping with these principles, the following subset of all
        possible F shall be considered standard optional formats.

        Format | hidden alignment Ebits      Fbits        L(bytes)
        -------|+------+---------+-----+----------------+----------
          F0   |  true     2        8          23             4
          F1   |  true     2(*)     9          30             5
          F2   |  true     2(*)    10          37             6
          F3   |  true     3       11          52             8
          F4   |  true     3(*)    13          66            10
          F5   |  true     3(*)    14          81            12
          F6   |  true     4       15         112            16
          F7   |  true     4(*)    16         143            20
          F8   |  true     4(*)    17         174            24
          F9   |  true     5       18         237            32
          F10  |  true     5(*)    19         300            40
          F11  |  true     5(*)    20         363            48
          F12  |  true     6       21         490            64
                ... in general for F4 & beyond for k >= 0 ...
        F(3k+4)|  true    k+3(*) 3k+13 5*2^(k+4)-(3k+14)  5*2^(k+1)
        F(3k+5)|  true    k+3(*) 3k+14 3*2^(k+5)-(3k+15)  3*2^(k+2)
        F(3k+6)|  true    k+4    3k+15   2^(k+7)-(3k+16)    2^(k+4)

                (*) For these formats, an implementer may choose to
                    increase or decrease alignment to suit.

        Note that the F0 format is standard single, the F3 format is
        standard double, and F6 is standard quad.

        In general, for each format, F(k), on this list, that format's
        standard extended shall be F(k+2).  Also, for each format, F(k),
        that format's standard double shall be F(k+3).


        Z.5  EFFICIENT USE OF PRECISION


        In order to return a result accurate to the precision of F,
        some algorithms must repeat the same calculations in an ever
        increasing series of precisions until an unusually restrictive
        error bound is met.  To minimize the total amount of work
        required, it is desireable to perform those calculations in a
        series of formats which increase in precision by approximately
        Sqrt[2] ~ 1.4 with each step.  Either of the subsets {F3, F4,
        F6, F7, F9, F10, F12, ...} or {F3, F5, F6, F8, F9, F11, F12,
        ...} approximate this series.

        08/17/04                                                        (8)

        Other algorithms must perform a similar series of calculations
        of increasing precision until an approximation of the precision
        needed may be accurately estimated after which it may jump
        directly to the final precision for one more calculation to
        return the desired result.  In such cases, the final format
        chosen should not be too much larger than that desired so as to
        minimize unneeded overhead.

        In addition, all known efficient algorithms for implementing
        large arithmetic operations (such as multiply) do not increase
        smoothly in run time as a function of the precision but instead
        exhibit large gaps in performance.

        For both these reasons, there should be several available
        precisions for each power of two in size to minimize both the
        actual cost of a given calculation and the chance that one will
        incur the cost of a large performance gap.

        Providing three precisions for each power of 2, {4, 5, and
        6}*2^k, should be sufficient to minimize these costs without too
        much implementation complexity.

        (PURPLE) With only two precisions available per binade, the
        (PURPLE) worst case performance gap can occasionally be as large
        (PURPLE) as (Sqrt[2])^2 or 2.  With three precisions available,
        (PURPLE) the worst case performance gap can be 2^(2/3) < 1.6 and
        (PURPLE) this should be very rare.

754 | revision | FAQ | references | list archive