On 2011-07-10 13:05:38 -0500, Nate Hayes wrote:
For this particular machine, it seems the first test (explicitly
changing/restoring the rounding mode for each individual addition
operation) was just a tiny bit slower than the "fixup" method of
using the nextup() routine; the small performance hit in this first
test is probably worth the extra accuracy it provides.
Concerning your nextup() test, there are many branches, and since
the code is run on the same data, I suppose that the branches can
be predicted correctly (if this is implemented that way on the
processor). In real codes, this solution might be slower.
For further comparision, I just ran a third test which simply does
1 billion additions in "round to nearest" mode, i.e., I made NO
attempt whatsoever to implement directed rounding. This test was an
order of magnitude faster than either of the previous two.
The moral of the story seems to be what I suspect most of us already
knew: that rounded operations can be easily emulated in software,
but real hardware support of these operations at the opcode level of
the processor will surely give dramatic speed improvements.
Probably. It would be interesting to run your benchmark on other
processors, in particular those supporting static (or semi-static)
rounding modes.