IEEE 754R minutes from November 21, 2002

Attendance and minutes review

The October meeting minutes were accepted without objection.

Decimal subcommittee note

Zuras briefly mentioned the status of the decimal subcommittee. A year ago, Cowlishaw approached the committee to talk about a decimal arithmetic standard. Those interested in the problem formed a subcommittee. We have settled some issues, and there remain a number of minor issues to be worked out, but there remains one major issue: normalization. We have agreed that we should come to the committee with a single proposal for debate. If we can come to an agreement soon, we would like to devote the January meeting (and perhaps more) to this issue. If we cannot come to an agreement, we will let the committee know one way or the other.

Draft review

Changes:

We are still waiting for a universally acceptable definition of exception. Jim Thomas asked about the definition of algebraic operations (in the context of propagation of NaNs); he thought there was a consensus from October that we should not write a specific list of operations instead of using a term like algebraic. Bindel replied that if there was such consensus, he was confused and failed to note it.

David Scott inquired about the behavior of rem(denormal, infinity) with underflow unmasked. We decided to come back to it whenever we finish with exceptions.

Definitions and wording

Hough noted that we still need definitions for exception, signal, and algebraic, but thought that since we could not summarize the situation in a few simple sentences, making definitions should be deferred. David Scott agreed to try to write definitions for exception and signal. Kahan agreed to try to write down the principles governing NaN behavior, though, as Hough commented, the principles are sufficiently involved that they may not belong in the standard. Kahan noted that NaNs have a historical background going back to Zuse, but the undefined and indeterminate notions that preceded NaNs foundered due to their lack of defined behavior. NaNs have defined behavior, and an attempt was made to define that behavior well; but since it's provably impossible to predict the behavior for a NaN argument for an arbitrary function, we must adopt conventions, and that will lead to some arguments. Fahmy suggested that, as a matter of presentation, it might be easier to just specify how NaNs are treated when we define the behaviors of specific functions.

min and max

We considered three types of min/max operations, each with several variants. Besides the ordinary max (for which only the behavior with signed zeros and NaNs was debated), we talked about an absmax function which would return the argument of largest magnitude or the magnitude of the argument with largest magnitude, an operation like the BLAS ixamax which would return the index of the largest magnitude element of a vector, and corresponding min functions. We spent most of our time debating the exceptional behavior of ordinary min/max. We also decided on a definition for absmin/absmax, but we decided not to say anything about maximum index operations.

Kahan proposed two mathematical characterizations for max over the reals plus points at +/-inf which can be extended to NaNs:

  1. z := Max{x, y} iff z <= x or z <= y
  2. z := Max{x, y} iff z >= x and z >= y
Using the first definition, Max{5, NaN} = 5. Under the second definition, Max{5, NaN} = NaN. There is no mathematical reason to prefer one reason to another.

We listed the following problem cases for defining min and max:

We agreed from the start that min and max should be symmetric in their arguments, except possibly when both arguments are NaN (we originally said commutative rather than symmetric until Kahan suggested the latter term). Zuras suggested we apply a total ordering on all NaNs (e.g. order by the significand value) in order to decide the behavior of max(NaN, NaN), but there was no consensus. Schwarz commented that, from a hardware perspective, he would prefer not to worry about symmetry of max with respect to NaN arguments.

We argued far more about the behavior of max(x, NaN) where x is non-NaN. According to C99, max(x, NaN) is x; according to Java, max(x, NaN) is NaN. Schwarz and Riedy argued that max(x, NaN) should be NaN so that NaN results indicating a problem in a computation would not be lost. Thomas and Kahan argued that it is advantageous to get rid of NaNs whenever possible. Thomas pointed in particular to the behavior of hypot(inf, nan), and asked why similar principles did not apply to max. Reasonable behaviors seemed to include

The question, then, was which function should be the default. After some argument, Schwarz was persuaded that max(inf,NaN) at least should be defined as inf. When we returned after a break, others seemed to have accepted the notion that max should return a number when possible.

We also debated the signal behavior of NaNs. One suggested compromise was that max(x, NaN) would return x, but would signal invalid. Thomas pointed out that this would make max and min the only floating point operations that return a non-NaN floating point number but raise invalid. We eventually decided not to signal invalid on max(x, NaN), or at least we quit discussing the matter.

After we decided on the fate of ordinary min and max (except in the NaN, NaN case), we turned to absmin and absmax. Markstein and Okada pointed out that absmin and absmax are good for certain compensated summation algorithms. We settled on the following definition with surprisingly little debate: absmax returns the argument which is largest in absolute value. If both arguments have equal magnitude, then we let

Integer formats

Hough described his proposal for the addition of specified signed and unsigned integer formats of the same length as the floating point types. We have talked about how conversion to integer should work; it would also be convenient to have these types in order to get the bit pattern for a floating point number.

Hough proposed that the formats come in pairs: half size and full size. Only a few operations are specified: multiplications to get a full size number from two half sizes, and division of a full size by a half size to get a half size. Kahan asked about the application for these sizes; Hough responded that he used such things for binary to decimal conversion and for parts of certain high precision arithmetic ops inside transcendental functions and FMA. Hough noted that a better approach might be to specify same-size formats, and then specify the arithmetic operations for implementations that support both the 32-bit and 64-bit formats.

Riedy asked whether the arithmetic was actually used anywhere in the proposal. Hough replied that it was not, but that good integer support is part of the environment needed for certain types of floating point computation.

Hough then asked whether we wanted to pursue the proposal further. We decided during Hough's description to remove the half precision part, and to put any integer arithmetic operations in an appendix. That still leaves some other parts that affect the standard. Zuras summarized the committee's opinion as we're kind of agreed in principle except other details.

754 | revision | FAQ | references