IEEE 754R minutes from September 19, 2002

Attendance and minutes review

We met in September at HP Labs; Peter Markstein was our host. We accepted the August minutes without modification. To accommodate call-in participants, Kahan discussed underflow handling at the beginning of the meeting. David Hough asked for time to talk about static modes, exception names, and alternate exception handling. Otherwise, we did not modify the agenda.

Hardware implementation of underflow

We began with a presentation on gradual underflow by Prof Kahan.

Implementation mechanisms

There are two parts to gradual underflow: rounding underflowed results, and accepting underflowed operands. We respond to underflowed products and quotients by producing correctly rounded subnormal numbers. Underflowed results of sums and differences are exact, so we need not worry about rounding them; we just need to put them in the proper format. Unless the programmer means to trap on underflow, the only time we need worry about subnormal operands is on multiplication and division. For multiply and divide, it is easiest to think about (if not to implement) subnormal operand handling as a two-part process:

The result of the operation may be an ordinary floating point number, or the operation might cause another underflow or an overflow.

Some implementors think gradual underflow is too complicated and expensive to handle in hardware, and so they choose to trap to software. We distinguish between traps, which occur after the offending instruction has finished, from faults, in which the offending instruction behaves as if it never issued. It is simpler to handle underflow with a trap rather than a fault; however, most of the mechanisms described here do not depend on the difference.

Heavy weight traps are too expensive to be reasonable for underflow handling, and they are not necessary. Therefore trap-based underflow handling should use lightweight traps. There is no need to save the entire register file, for instance: only the operands are needed for input, and only the destination register will be modified. Remember also that handling underflow in the default manner will not change dependency relations in the instruction schedule. It will only change the time at which the result is available.

Architectures with fused multipy-adders need hardly any additional support to deal with gradual underflow, since the fused multiply-add unit already has a double-width register used for the multiply result. That's why hardware designers who provide an FMA instruction will not be terribly troubled by gradual underflow. When they have this extra-wide register, they can multiply subnormal operands. All they have to do is complicate the logic in the subsequent add. The only interesting case is the product of a normal and a subnormal; when two subnormals are multiplied, the entire result goes into the sticky bit.

We prefer to implement gradual underflow without trapping. To avoid traps, we treat underflow in much the same way we treat cache misses or page miss. Like a memory miss, a gradual underflow event does not change any register dependencies; it only delays when the result can be used. We have in mind where an architecture where the floating point registers are distinguished from other registers. These floating point registers hold more than just the bits for the standard representation. They also have space for a tag field. On the 8087, there was a three bit tag field which was hardly ever used; the field atrophied in the descendants of the 8087. The purpose of the field was to give later, faster implementations a place to store information about whether the register contents were ready to be used for a particular purpose. If the register contents are unready, the tag bits say what to do: you may need to normalize, or you may need to denormalize.

In this scheme, we need two extra instructions each of which looks like a variant of add 0:

Besides two extra exponent bits and two tag bits to distinguish between types of representations, we need two bits to correctly re-round results. Corinna Lee describes in detail how to perform such multistep rounding in Multistep Gradual Rounding, which appeared in IEEE Transactions on Computers, Vol 28, No 4, April 1989.

In total, we need six extra bits for this scheme: two to extend the exponent range; two for multistep rounding; and two to tell how the register contents can be used. We can write nanocode or microcode to handle the processing. We have no more reason to trap on underflow than we do to trap on a cache miss, except for the engineering cost. A fast trapping implementation could also use normalize and denormalize operations as described above, but these will only make sense if those extra bits are available. In particular, you need the bits that tell you how to re-round.

Why implement fast gradual underflow?

Why should hardware designers worry about whether gradual underflow will be fast or slow? Hardware implementors will worry about why it should be fast; applications programmers will worry why it shouldn't be slow.

We want underflow to be, as nearly as possible, negligible. Why? Because people will ignore it anyway! We need enhance the prospects that code will survive such indifference. Of course, some programmers will worry about underflow, and there must be a way for them to produce code they can prove correct. We would like to have as many proven codes as we can get, so that we can use them as building blocks. The crux of the matter is this: how much more is it worth to have a program that always works instead of one that almost never fails?

So why make underflow fast? Underflows occur infrequently, but when they do occur, they often come in convoys. Whatever causes one underflow will usually cause a lot more. So occasionally a program will encounter a large batch of underflows, which makes it slow. The loss of speed will upset someone. Some deal with the loss of speed by saying, use flush-to-zero mode; that will go fast. In many cases, the underflowed value will ultimately disappear when added to something large; in those cases, flush-to-zero is reasonable. But flush-to-zero must not be the default. It is dangerous to make flush-to-zero the default, and it is dangerous to make flush-to-zero the de facto default by making gradual underflow too slow an alternative. It is not dangerous often, but that does not matter. We should take an actuarial approach: multiplying the probability of bad things happening by the cost when bad things do happen, but we lack statistics for both the incidence and the cost of problems due to flushed underflow. The effects of flushed underflow are subtle, and bugs caused by misapplication of flush-to-zero are difficult to diagnose.

If we want reliable codes, we must make codes easy to prove, and we must provide default behaviors that give numerically naive programmers the best chance of avoiding problems. The right default for both these tasks is gradual underflow. Does your code waste time computing tiny things that do not matter? If so, perhaps you should change the code to avoid wasting time computing those things. Or perhaps those tiny things do matter, and the underflow flag is warning you that you need to do something.

Discussion

Executive summary:

Zuras noted during the presentation that FMA implementations which use Booth encoding need to do something special to handle subnormal operands. The only issue is that the most significant bit is handled differently from the rest of the bits; in the absence of an implicit leading MSB, additional logic is required. Both Zuras and Kahan agreed that this is not a significant problem.

Das Sarma agreed that we should try to support gradual underflow in hardware, but expressed concern that Kahan's scheme would slow down the common case. For example, suppose you have a four cycle add instruction with bypass. Then as soon as you have the result, even before you write it to the register file, you want to forward it to another instruction that takes it as an operand. That means the result tag must also be determined early! It seems like you would face paying a cost even in the common case.

Zuras noted that underflow can conservatively be predicted early. That way you can avoid the two cycle hit in most of the common cases. Das Sarma noted that the penalty could be much higher for faster machines. The point is that variable latency instructions pose a lot of problems in the fast execution of the common case. Zuras agreed that for any variable-length instruction in a pipeline, it is important to determine the cycle length early in the instruction execution.

Kahan commented that if you are asked to normalize something which has already been normalized, no harm will be done; it will just take extra time. Das Sarma replied that, on the Athlon, the overhead is only paid when an over/underflow really does occur, not when numbers approach the over/underflow threshold. Zuras asked how the Athlon does that, and Das Sarma replied that the instruction is started speculatively and aborted with a pipeline flush if there are any problems. In essence, it does a lightweight trap. The performance problem, though, is not in the trap cost; it is in the pipeline flush. Kahan thought such a lightweight trap was reasonable, so long as it costs no more than perhaps 10X the normal instruction cost. An overhead of 100X would be intolerable.

Zuras interjected an anecdote about a chip he once designed which handled underflow correctly and quickly. The follow-on processor used Weitek chips, which did not handle underflow in hardware. In general, the later processor was 3X faster. However, one of the SPEC benchmarks had fairly common underflows in a summation loop. The underflowed values contributed not at all to the sum, but underflow occurred with sufficient frequency that the entire benchmark was 6X slower than it should have been. The trap itself was 300X slower than the common case execution.

Riedy asked about the possibility of adding a fifth stage, and always predicting that stage does not happen. If underflow occurs and the additional stage is required, he suggested, one could just stall the processor. Das Sarma replied that there were multiple problems with this approach. For instance, consider the case where the FPU is architected as a coprocessor. Stalling the whole machine from the FPU is unreasonable!

Riedy asked how interactions between the FPU and the memory unit work on the Athlon, and Das Sarma replied that there are two paths to the FPU: one for data from the integer unit, and one for data from the memory unit. If an add instruction needs a memory operand, it will not be initiated by the floating point scheduler until the load data arrives. That's one microarchitecture; there are several variations.

Zuras asked whether the position that gradual underflow should be done mostly in hardware was at all controversial. Kahan noted that for some applications, such as signal processing, the data remains in a sufficiently restricted range that handling of over/underflow is not an issue. Someone mentioned graphics as another example where range might not be an issue, but several participants responded that for certain geometric computations important in computer graphics, underflow is a definite possibility.

Jim Thomas noted that he sometimes heard hardware people take the position that the value of gradual underflow is insufficient to balance the cost of incorporating it into fast hardware. Steve Bass defended that position. He noted that dealing with exceptional cases can cost a lot in circuitry, and, like Das Sarma, expressed concerns about the impact on common case performance. Bass also noted that on the machine on which he works, all floating point operations take the same number of cycles, and therefore the scoreboarding logic is relatively simple. Variable latency instructions would ruin that simplicity.

Bass asked why it would be inadequate to simply keep extra exponent bits and only worry about denormalization when storing. Several responded: you still must round to the correct position. Kahan noted that, in the case of a fused multiply-add, it is not necessary to round the intermediate product.

Discussion then turned to gradual underflow when FMA hardware is available. Hinds noted that even when an FMA is available, there is still a need to deal with underflowed results. Zuras thought it was probably not difficult, but invited any in the audience who had worked recently with FMA hardware to speak more authoritatively. Bass agreed that it was probably not difficult to add gradual underflow logic to a fused multiply-adder, but warned that any change is expensive, particularly when the design must be portable between processes. Erle mentioned that for a project with which he was involved, the designers did need to increase the width of alignment for bringing denormalized operands in without corrupting the product, and that increased the amount of circuitry. The difficulty is with the short (108 bit) adder at the end of the FMA. Bass thought that might be the same difficulty encountered in the HP design, since they also used a short adder.

Erle commented that he was against full hardware support for gradual underflow because he did not think his customer base would appreciate or enjoy it. Kahan responded that it depended whether his customers would appreciate software which is guaranteed to always work. It is possible to prove correctness of floating point software with flushed underflow, but the proofs are sufficiently more difficult that they will usually not be done.

Kahan then expounded further: The overwhelming majority of users are practically unaffected by underflow handling. A small number of programs are affected, and even those programs are not affected much. Because so few programs are affected, it is difficult for the customer base to notice whether gradual underflow is well supported or not. If that was not the case, gradual underflow never would have been made part of the original 754 standard, since it would have impacted a substantial number of programmers in a substantial way. So there is a restricted set of circumstances in which gradual underflow makes a difference, but in those circumstances it does make a difference. For example:

These examples illustrate two situations: computations which some users worry about but most do not, and computations which practically everyone does, but which hardly ever enter a regime where they underflow. Our treatment of underflow should mimic the treatment of similar problems which occur throughout engineering. We design buildings stronger than appears necessary; due diligence obliges us to similarly take care of details like gradual underflow.

Erle replied that the IBM systems on which he works do support gradual underflow. His objection is simply to providing full hardware support for denormal handling. Zuras and Kahan repeated their earlier position: a penalty of 10X is acceptable, but a penalty of hundreds of times the ordinary cost is not.

Zuras remarked that hardware to make software easier to write is an enormous win. He contended that Moore's law has primarily been due to a reduction in cost of hardware, and that the cost of software has remained relatively level. While most software is written by those who are not floating point experts, there are very widely used packages which are written by experts, and support that eases the development of such packages has value. If done right, Zuras quoted, then Nobody will notice. Which is your reward.

Jim Thomas noted that slow underflow impacted compiler performance, too. Optimizers will not speculate floating point operations if it might result in dealing with unnormalized operands at too high a cost.

Fahmy asked whether it would be possible to handle the additional rounding required during denormalization by using a general piece of hardware that could round to any position. After all, he observed, there is already hardware available to round to single, double, and perhaps extended and internal precisions. Das Sarma responded that even if you have an instruction to round at an arbitrary bit position, you must still determine where that bit position will be, and that determination will take time.

Bass asked why we could not simply eliminate the extra rounding of denormals. Kahan replied that the proofs we have rely on rounding as it is, and it would be expensive to review the proofs to see if it is appropriate to bless such an economization.

Draft review

Signaling NaN revisited

In August, we decided to remove signaling NaNs as a requirement, provided it would do no harm. After the August meeting, committee members requested feedback from readers of newsgroups and mailing lists and from language liasons. Some responses were forwarded to the committee mailing list. Several responders used signaling NaNs for debugging, and Professor Gentleman described how he used signaling NaNs to represent missing data in statistical calculations. Gentleman wrote that his technique also works with third-party codes. We had not yet heard from the language liasons by the September meeting.

We disagreed on how to react to Gentleman's example. Hough and Liu thought Gentleman's technique was a compelling example that justified continued support for signaling NaNs. Zuras and Riedy thought the problem could equally well be handled with other techniques. Others in the audience wanted further time to read and digest the examples presented by Gentleman and others.

If missing data is a compelling application for signaling NaNs, there are implications for the rest of the standard. Gentleman's technique does not work without traps. If signaling NaNs stand in for missing data, then either the negate operation must trap or the semantics of the sign bit of a NaN must be defined. Similarly, quiet comparison operations might need to trap on signaling NaNs.

We decided to postpone further signaling NaN discussion until October. Until further discussion, we chose to leave Hough's editorial work in place, though still marked as tentative. The changes so far are modest: the requirement for at least one signaling NaN has been removed, and the discussion of signaling NaNs in section 6.2 (operations with NaNs) has been moved to appendix 6. Appendix 6 is also where recommendations regarding signaling NaNs will go in this organization.

Thomas, Zuras, and Fahmy all thought we should return to signaling NaNs soon so that we would have time to digest the feedback already sent to the list, as well as replies that we know are pending. Zuras would like actual code, or at least illustrative toy code, for the signaling NaN uses reported to the committee. With such code, we would be better able to judge whether a good alternative to signaling NaNs could be used. Riedy noted that at least one of his correspondents, an SAS developer, is unlikely to provide code.

Predicate table

The discussion of table 4 was mercifully short. In the August meeting, there was confusion about using the same symbol for quiet and signaling predicates. Hough disambiguated the symbols by using bangs to distinguish signaling predicates. Also, we did not decide in August whether we wanted to keep the <> predicate or omit it for lack of interest. By leaving it in, we would oblige language designers to implement it. Thomas noted that C99 has a binding for <>, though he would not speak for its utility. The expression A <> B can be written equally well as (A !< B and A !> B), so it seemed like the <> provided little value. Hough moved to remove <>, and there were no objections.

Traps, flags, and definition confusion

Kahan thought we should talk about raising and lowering flags instead of setting and resetting them. Thomas objected that, in our current usage, raising an exception indicates a possible trap, while setting a flag does not. Kahan said he wanted to distinguish exceptions that occur from exceptions that signal; this is particularly important for explaining the difference between signaled and unsignaled underflow conditions. Riedy recommended the phrase raises a condition. We seemed to settle on raise / lower, and Hough agreed to change the draft accordingly.

We briefly reviewed the different cases of underflow detection. In Kahan's terminology, underflow occurs when a subnormal result is produced, and it is signaled when there is a corresponding loss of accuracy. There are two ways to detect loss of accuracy. Intel (and other hardware manufacturers) interpret loss of accuracy on underflow to mean that the subnormal result produced is different from the infinitely precise result. The other permitted definition for loss of accuracy is that the subnormal result produced is different from the result that would be produced if the exponent range was unbounded. Kahan said he preferred the latter definition, but did not regard the former definition as a serious crime.

Hough moved traps to an appendix, and changed section 8 to a description of alternate exception handling. Appendix 8, in this organization, would contain advice to hardware implementors about what information to provide in order to make trap-based implementations feasible. Only authors of system programs would need to worry about the actual details of the traps; applications programmers would use the alternate exception handling mechanisms of section 8.

System program proved to be a controversial term. Liu thought it was fine not to distinguish between system and applications programs, and suggested we continue simply to refer to the user. Various members suggested the phrases system software, support program, and support software, and each of these phrases met with some objection. To Kahan, software refers to a shrink-wrapped product, while a program is more general; to Liu, a program is an independent executable unit, while software refers to code in general. Riedy suggested that we describe trapping as a mechanism to support alternate exception handling, and, from agreement or exhaustion, nobody objected.

754 | revision | FAQ | references