Re: ExactDotProduct
> Date: Tue, 03 Nov 2009 14:57:50 -0500
> To: stds-1788 <stds-1788@xxxxxxxxxxxxxxxxx>
> From: Michel Hack <hack@xxxxxxxxxxxxxx>
> Subject: Re: ExactDotProduct
>
> James Demmel wrote:
> > My point was that even a single hardware register is implicitly sorting,
> > bucket sorting by exponents, because it needs to be able to add and
> > possibly cancel operands with overlapping mantissas.
>
> Ulrich's point was that a hardware implemention simply uses a subset
> of the exponent bits to index into the correct bucket, so to speak.
> Carries may of course propagate into higher-exponent buckets, and in
> order to reduce carry propagation, one would use two accumuluators
> in practice, summing positive and negative terms separately, and
> doing one big subtraction at the end. In a parallell environment
> one might use several accumulator pairs.
>
> The hardware primitive would be shift-and-add (to position the addend
> relative to the 64-bit buckets, say).
>
> Michel.
> ---Sent: 2009-11-03 20:06:08 UTC
Michel,
I suspect Jim understands this point.
On the contrary, Ulrich's approach uses a SUPER set of
exponent bits to index into a table of registers (together
with their carry bits) to implement what is known as a
radix sort. While this may be O(1) in time it is O(2^e)
in space, where e is at least one more than your exponent
bits.
And the space involved is not memory. It is registers.
Many registers. Many many registers.
Even today when computers are cheap enough to throw away
with your clothes, registers are expensive in the design
of such machines.
Perhaps there will come a time when this is not so but
I suspect registers will always be expensive in any given
design relative to the cost & application of that design.
The reason for this is that even if a designer chooses to
design in many many registers there are always other things
that can be done with those registers that have a far
greater impact on performance than to dedicate them to the
use of accurate dot products.
Still, I believe Ulrich's approach might find its way into
a general purpose design some day if it were implemented
using a large set of registers that the computer could ALSO
use for other, much more common, purposes.
So, this is the real tradeoff: Ulrich's super registers
giving a fast & accurate (if expensive) dot product.
Versus approaches using main memory & more likely given
in software than hardware, some accurate & some only
faithful.
That is, IMHO, where one's attention should be in the
consideration of this motion.
Dan