Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Motion 9 ExactDotProduct



Dear all:

Of course, I am very happy that Motion 9 passed.

But reading over again and again the comments that came with the 'NO' votes I am severely doubting whether Motion 9 was really understood. 'Simplicity' is required. 'The standard should not be too heavy'.

Actually Motion 9 describes the simplest and fastest way for computing a dot product. Take a chest of drawers with 67 numbered drawers. Each one holds 64 bits. The exponent of the summand (the product) consists of 12 bits. The leading 6 bits give the address of the three consecutive drawers to which the summand of 106 bits is added. The low end 6 bits of the exponent are used for the correct positioning of the summand within the selected drawers. The addition affects at most 170 bits of these drawers, i.e., an adder of 170 bits could execute the addition in a single add cycle.

A carry is absorbed by the next more significant drawer in which not all bits are 1. For fast detection of this word a flag is attached to each drawer. It is set 1 if all bits of the word are 1. This means that a carry will propagate through the entire word. As soon as the exponent of the summand is available the flags allow selecting and incrementing the carry word. This can be done simultaneously with adding the summand into the selected drawers. If the addition produces a carry the incremented word is written into the carry word. Otherwise it is left as it was. A zero flag may serve the same purpose in case of subtraction.

There is indeed no simpler way of accumulating a dot product. Any method that just computes an approximation also has to consider the relative values of the summands. This results in a more complicated method.

The fascinating property of the exact dot product (Motion 9) is the fact that it can be computed with extreme speed, ideally in the time the processor needs to read the data. No special cases have to be dealt with. The technique of adding a medium sized bit string to a very long one may have applications in other areas of computing as well.

All conventional vector processors provide a 'multiply and accumulate' operation (the dot product) to achieve high speed. In a pipeline the arithmetic (the multiplication and the accumulation) is done in the time the processor needs to read the data. Very effective vectorizing compilers have been developed that use this 'multiply and accumulate' operation within a user's program as often as possible, since this greatly speeds up program execution. However, the accumulation is done in floating-point arithmetic. The so-called 'partial sum technique' alters the sequence of the summands and causes errors in addition to the usual floating-point errors. The exact dot product avoids all numerical errors and at the same speed. The hardware needed for it is comparable to that for a fast multiplier by a Wallace tree, accepted years ago. In speed a hardware implementation of the exact dot product exceeds computing an accurately rounded dot product in software by several orders of magnitudes.

Best wishes and a happy holiday season
Ulrich Kulisch

A disappointing feature is the failure of the numerical analysts to influence computer hardware and software in the way they should. It is often said that the use of computers for scientific work represents a small part of the market and numerical analysts have resigned themselves to accepting facilities "designed" for other purposes and making the best of them.
J. H. Wilkinson: Turing Lecture 1970, J. ACM 18 (1971), 146.