Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: ExactDotProduct



Dan Zuras Intervals schrieb:
Date: Tue, 03 Nov 2009 14:57:50 -0500
To: stds-1788                           <stds-1788@xxxxxxxxxxxxxxxxx>
From: Michel Hack                                 <hack@xxxxxxxxxxxxxx>
Subject: Re: ExactDotProduct

James Demmel wrote:
My point was that even a single hardware register is implicitly sorting,
bucket sorting by exponents, because it needs to be able to add and
possibly cancel operands with overlapping mantissas.
Ulrich's point was that a hardware implemention simply uses a subset
of the exponent bits to index into the correct bucket, so to speak.
Carries may of course propagate into higher-exponent buckets, and in
order to reduce carry propagation, one would use two accumuluators
in practice, summing positive and negative terms separately, and
doing one big subtraction at the end.  In a parallell environment
one might use several accumulator pairs.

The hardware primitive would be shift-and-add (to position the addend
relative to the 64-bit buckets, say).

Michel.
---Sent: 2009-11-03 20:06:08 UTC

	Michel,

	I suspect Jim understands this point.

	On the contrary, Ulrich's approach uses a SUPER set of
	exponent bits to index into a table of registers (together
	with their carry bits) to implement what is known as a
	radix sort.  While this may be O(1) in time it is O(2^e)
	in space, where e is at least one more than your exponent
	bits.

	And the space involved is not memory.  It is registers.
	Many registers.  Many many registers.

	Even today when computers are cheap enough to throw away
	with your clothes, registers are expensive in the design
	of such machines.
Dan:

I agree that the use of what you call the 'super register' would be the ideal and the fastest solution. To add a product to the 'super register' only two memory words must be read. The result of the accumulation of the product appears in the 'super register'.

The basic idea discussed here was realized on IBM, SIEMENS, and Hitachi computers 25 years ago. These computers did not provide enough register space. So here the 'super register' was placed in the user memory. The disadvantage of this solution is that for each accumulation step, four memory words (the three words to which the product is added and the word which absorbs the carry) must be read and written in addition to the two operand loads. So the scalar product computation is slower than by use of a 'super register'. However, in practice today the necessary memory space would probably be in cache. So the loss in speed would be marginal. Anyway, compared with any (even fast) software solution the gain in speed would still be tremendous.

Genarally, realization of Motion 9 probably will be architecture dependent. I have the highest opinion of the hardware designers and I am convinced that they shall develop clever and intelligent ideas if Motion 9 gets accepted.

Best regards
Ulrich