Motion 9 ExactDotProduct
Dear all:
Of course, I am very happy that Motion 9 passed.
But reading over again and again the comments that came with the 'NO'
votes I am severely doubting whether Motion 9 was really understood.
'Simplicity' is required. 'The standard should not be too heavy'.
Actually Motion 9 describes the simplest and fastest way for computing a
dot product. Take a chest of drawers with 67 numbered drawers. Each one
holds 64 bits. The exponent of the summand (the product) consists of 12
bits. The leading 6 bits give the address of the three consecutive
drawers to which the summand of 106 bits is added. The low end 6 bits of
the exponent are used for the correct positioning of the summand within
the selected drawers. The addition affects at most 170 bits of these
drawers, i.e., an adder of 170 bits could execute the addition in a
single add cycle.
A carry is absorbed by the next more significant drawer in which not all
bits are 1. For fast detection of this word a flag is attached to each
drawer. It is set 1 if all bits of the word are 1. This means that a
carry will propagate through the entire word. As soon as the exponent of
the summand is available the flags allow selecting and incrementing the
carry word. This can be done simultaneously with adding the summand into
the selected drawers. If the addition produces a carry the incremented
word is written into the carry word. Otherwise it is left as it was. A
zero flag may serve the same purpose in case of subtraction.
There is indeed no simpler way of accumulating a dot product. Any method
that just computes an approximation also has to consider the relative
values of the summands. This results in a more complicated method.
The fascinating property of the exact dot product (Motion 9) is the fact
that it can be computed with extreme speed, ideally in the time the
processor needs to read the data. No special cases have to be dealt
with. The technique of adding a medium sized bit string to a very long
one may have applications in other areas of computing as well.
All conventional vector processors provide a 'multiply and accumulate'
operation (the dot product) to achieve high speed. In a pipeline the
arithmetic (the multiplication and the accumulation) is done in the time
the processor needs to read the data. Very effective vectorizing
compilers have been developed that use this 'multiply and accumulate'
operation within a user's program as often as possible, since this
greatly speeds up program execution. However, the accumulation is done
in floating-point arithmetic. The so-called 'partial sum technique'
alters the sequence of the summands and causes errors in addition to the
usual floating-point errors. The exact dot product avoids all numerical
errors and at the same speed. The hardware needed for it is comparable
to that for a fast multiplier by a Wallace tree, accepted years ago. In
speed a hardware implementation of the exact dot product exceeds
computing an accurately rounded dot product in software by several
orders of magnitudes.
Best wishes and a happy holiday season
Ulrich Kulisch
A disappointing feature is the failure of the numerical analysts to
influence computer hardware and software in the way they should. It is
often said that the use of computers for scientific work represents a
small part of the market and numerical analysts have resigned themselves
to accepting facilities "designed" for other purposes and making the
best of them.
J. H. Wilkinson: Turing Lecture 1970, J. ACM 18 (1971), 146.