Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Rounded operations: test



Vincent Lefevre wrote:
On 2011-07-10 13:05:38 -0500, Nate Hayes wrote:
For this particular machine, it seems the first test (explicitly
changing/restoring the rounding mode for each individual addition
operation) was just a tiny bit slower than the "fixup" method of
using the nextup() routine; the small performance hit in this first
test is probably worth the extra accuracy it provides.

Concerning your nextup() test, there are many branches, and since
the code is run on the same data, I suppose that the branches can
be predicted correctly (if this is implemented that way on the
processor). In real codes, this solution might be slower.

In the test, note that x2 is both an operand and a result, i.e.,
 x2 = nextup( x2 + y2 );
 x2 = nextup( x2 + y2 );
 x2 = nextup( x2 + y2 );
 x2 = nextup( x2 + y2 );
 ...
etc. So the nextup() function never receives the same operand twice for any
of the 1 billion operations.

BTW, nextup() is considerably slower when compiled and run in 32-bit mode;
on 64-bit machines the 64-bit integers can fit nicely into a register and be
operated on in a single instruction, but in 32-bit mode they must be treated
as multi-precision ints operated on in multiple instructions and with
additional branching inserted by the compiler.

I ran the test in 32-bit mode because the Microsoft compilers don't support
inline assembly in 64-bit mode for the add() function. Someone with a better
compiler might want to try running the tests in 64-bit mode.


For further comparision, I just ran a third test which simply does
1 billion additions in "round to nearest" mode, i.e., I made NO
attempt whatsoever to implement directed rounding. This test was an
order of magnitude faster than either of the previous two.

The moral of the story seems to be what I suspect most of us already
knew: that rounded operations can be easily emulated in software,
but real hardware support of these operations at the opcode level of
the processor will surely give dramatic speed improvements.

Probably. It would be interesting to run your benchmark on other
processors, in particular those supporting static (or semi-static)
rounding modes.

I'd like to see such a comparision, too. We don't have access to such
hardware platform.

Nate