On the history of CA and the EDP
Computing a correctly rounded dot product by
sorting was repeatedly recommended. The method is described in
my book with Miranker: Computer Arithmetic in Theory and
Practice, Academic Press 1981, and in an earlier book in
German: Grundlagen des Numerischen Rechnens, Bibliographisches
Institut, Mannheim Wien Zuerich, 1976. I used it in these
books to prove that semimorphic operations can be realized in
matrix and vector spaces. In the early 1970ies (it was the
time of the 8086 and the Z80) this was a method that could be
realized. It computes a correctly rounded dot product. Memory
space was expensive in these days.
But with the steadily increasing computer
technology the method with the long accumulator (CA and EDP)
turned out to be superior. It avoids sorting and allows more
applications. In a project supported by Nixdorf Computers
(1976-1980) we developed a first version of our Pascal-XSC. It
provides a function scalp
for the EDP. Scalp has three parameters. The first two specify
the vectors to be multiplied and the third one a major number
of rounding modes. Together with a large number of problem
solving routines we exhibited the system on the Hannover
Exhibition in April 1980. Nixdorf did not commercialize the
system. Numerical computing did not fit into its marketing
strategy. But we got a number of systems and decentralized the
programming education with it in the summer of 1980, while
batch processing of punch cards and punch tapes were standard.
(The PC entered the scene in 1982). In the following years the
co-developers Siegfried Rump and Juergen Wolff von Gudenberg
gave about 200 demonstrations of the Z80 based system all over
the world.
Thus IBM got aware of its superior computing properties and it
came to a cooperation with IBM of about ten years duration
(1980-1990). For its /370 architecture we developed a general
programming environment ACRITH-XSC which is a Frotran-77
extension similar to our Pascal-XSC together with a large
number of problem solving routines with automatic result
verification. Parallel to this we transferred our Pascal-XSC
to the PC, updated it and we developed a corresponding C-XSC
together with several Toolbox volumes for advanced
applications. See the list of literature. All three
XSC-languages provide CA and the EDP. The long Accumulator is
kept in memory. On the /370 architecture about 1200 bits are
needed for it because of its modest exponent range. Siemens
and Hitachi
followed immediately the development at IBM. Only 6 weeks
after IBM had announced its product ACRITH-XSC and released
details of its implementation Hitachi had it already on
own computers. So CA and the EDP were state of the art on IBM,
Siemens, and Hitachi
computers as well as in Pascal-XSC, and in C-XSC 25 years ago.
Placing the long accumulator in the user memory was not done
by choice, but by necessity. Computers at these days did not
provide enough register space. A disadvantage of this solution
is that for each accumulation step, four memory words must be
read and written in addition to the two operand loads. (For
details see my book). So the scalar product computation speed
is limited by the memory to processor transfer bandwidth and
speed.
Of course it would be better and much faster to keep the long
accumulator as local memory or register space on the
arithmetic unit. We realized it that way on our chip (1994).
See the attachment. It provides CA and the EDP for the IEEE
754 format double precision. About 4200 bits are needed for
the complete register. The chip was developed in a gate array
technology as coprocessor for the Intel processor 486 which at
that time ran at 32 MHz and got its data via a 32 bit bus. So
reading the two 64 bit factors of a product took four cycles.
This allowed doing the multiplication in a pipeline also in
four steps. So only a 27 bit multiplier was provided. A 53 bit
multiplier would at least double the silicon area. The picture
shows that the complete register and the carry resolution
logic together require about the same silicon area than the
fast multiplier. It brings a similar speed up for
accumulations than the Wallace tree for multiplications. Under
Pascal-XSC the chip performed the EDP between two and four
times faster than the main processor an approximation in
conventional floating-point arithmetic.
Experience shows that for numerical computing, manufacturers
only implement what the arithmetic standard requires. So IEEE
P1788 indeed should require Ca and the EDP. Only recommending
it is probably not enough to get it. The /370 architecture in
the old days was a lucky exception. Here IBM defined the
standard. Realization of the exact dot product by other
manufacturers was not a big problem.
On Intel and AMD x86-64 processors at least 16 registers are
available for words of 128 bits, 2048 bits altogether. This is
about twice as much as was needed for the long accumulator on
the /370 architecture and it is about 50% of the long
accumulator for the IEEE 754 format double precision. It could
be used to support an inner part of the complete register by
fast hardware. See the section Hardware Accumulation Window in my book. The
outer parts of this window would then be handled in software.
Probably they would seldom be used. Then the vast majority of
scalar products would exactly execute on fast hardware. A
scalar product computation, in general, does not require an
extremely large exponent range (as might be necessary for
polynomials with high powers).
Implementing CA and the EDP is more or less straightforward. A
general experience, however, tells that with design tools that
are publicly available a processor only reaches about 10% of
the speed a company would reach with its own optimized design
tools. So to develop its full power the EDP must be
incorporated into the ALU of the processor. I have no doubts
that we shall get it if we require it.
Interval arithmetic brings guarantees and safe
bounds into computing. These bounds frequently are overly
pessimistic. The EDP is the most general universal tool for
getting close bounds. It brings accuracy and
speed. A combination of both is what is needed from a modern
computer. We did not get the EDP from IEEE 754. So it is most
natural that we provide it in IEEE 1788.