Thread Links | Date Links | ||||
---|---|---|---|---|---|
Thread Prev | Thread Next | Thread Index | Date Prev | Date Next | Date Index |
Dan Zuras Intervals schrieb:
Date: Tue, 03 Nov 2009 14:57:50 -0500 To: stds-1788 <stds-1788@xxxxxxxxxxxxxxxxx> From: Michel Hack <hack@xxxxxxxxxxxxxx> Subject: Re: ExactDotProduct James Demmel wrote:My point was that even a single hardware register is implicitly sorting, bucket sorting by exponents, because it needs to be able to add and possibly cancel operands with overlapping mantissas.Ulrich's point was that a hardware implemention simply uses a subset of the exponent bits to index into the correct bucket, so to speak. Carries may of course propagate into higher-exponent buckets, and in order to reduce carry propagation, one would use two accumuluators in practice, summing positive and negative terms separately, and doing one big subtraction at the end. In a parallell environment one might use several accumulator pairs. The hardware primitive would be shift-and-add (to position the addend relative to the 64-bit buckets, say). Michel. ---Sent: 2009-11-03 20:06:08 UTCMichel, I suspect Jim understands this point. On the contrary, Ulrich's approach uses a SUPER set of exponent bits to index into a table of registers (together with their carry bits) to implement what is known as a radix sort. While this may be O(1) in time it is O(2^e) in space, where e is at least one more than your exponent bits. And the space involved is not memory. It is registers. Many registers. Many many registers. Even today when computers are cheap enough to throw away with your clothes, registers are expensive in the design of such machines.
Dan:I agree that the use of what you call the 'super register' would be the ideal and the fastest solution. To add a product to the 'super register' only two memory words must be read. The result of the accumulation of the product appears in the 'super register'.
The basic idea discussed here was realized on IBM, SIEMENS, and Hitachi computers 25 years ago. These computers did not provide enough register space. So here the 'super register' was placed in the user memory. The disadvantage of this solution is that for each accumulation step, four memory words (the three words to which the product is added and the word which absorbs the carry) must be read and written in addition to the two operand loads. So the scalar product computation is slower than by use of a 'super register'. However, in practice today the necessary memory space would probably be in cache. So the loss in speed would be marginal. Anyway, compared with any (even fast) software solution the gain in speed would still be tremendous.
Genarally, realization of Motion 9 probably will be architecture dependent. I have the highest opinion of the hardware designers and I am convinced that they shall develop clever and intelligent ideas if Motion 9 gets accepted.
Best regards Ulrich