Thread Links			Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

History

To: stds-1788 <stds-1788@xxxxxxxxxxxxxxxxx>
Subject: History
From: Ulrich Kulisch <ulrich.kulisch@xxxxxxx>
Date: Wed, 7 Aug 2013 10:31:05 +0200
Delivered-to: mhonarc@xxxxxxxxxxxxxxxx
List-help: <https://listserv.ieee.org/cgi-bin/wa?LIST=STDS-1788>, <mailto:LISTSERV@LISTSERV.IEEE.ORG?body=INFO%20STDS-1788>
List-owner: <mailto:STDS-1788-request@LISTSERV.IEEE.ORG>
List-subscribe: <mailto:STDS-1788-subscribe-request@LISTSERV.IEEE.ORG>
List-unsubscribe: <mailto:STDS-1788-unsubscribe-request@LISTSERV.IEEE.ORG>
Sender: stds-1788@xxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130620 Thunderbird/17.0.7

On the history of CA and the EDP

Computing a correctly rounded dot product by sorting was repeatedly recommended. The method is described in my book with Miranker: Computer Arithmetic in Theory and Practice, Academic Press 1981, and in an earlier book in German: Grundlagen des Numerischen Rechnens, Bibliographisches Institut, Mannheim Wien Zuerich, 1976. I used it in these books to prove that semimorphic operations can be realized in matrix and vector spaces. In the early 1970ies (it was the time of the 8086 and the Z80) this was a method that could be realized. It computes a correctly rounded dot product. Memory space was expensive in these days.

But with the steadily increasing computer technology the method with the long accumulator (CA and EDP) turned out to be superior. It avoids sorting and allows more applications. In a project supported by Nixdorf Computers (1976-1980) we developed a first version of our Pascal-XSC. It provides a function scalp for the EDP. Scalp has three parameters. The first two specify the vectors to be multiplied and the third one a major number of rounding modes. Together with a large number of problem solving routines we exhibited the system on the Hannover Exhibition in April 1980. Nixdorf did not commercialize the system. Numerical computing did not fit into its marketing strategy. But we got a number of systems and decentralized the programming education with it in the summer of 1980, while batch processing of punch cards and punch tapes were standard. (The PC entered the scene in 1982). In the following years the co-developers Siegfried Rump and Juergen Wolff von Gudenberg gave about 200 demonstrations of the Z80 based system all over the world.

Thus IBM got aware of its superior computing properties and it came to a cooperation with IBM of about ten years duration (1980-1990). For its /370 architecture we developed a general programming environment ACRITH-XSC which is a Frotran-77 extension similar to our Pascal-XSC together with a large number of problem solving routines with automatic result verification. Parallel to this we transferred our Pascal-XSC to the PC, updated it and we developed a corresponding C-XSC together with several Toolbox volumes for advanced applications. See the list of literature. All three XSC-languages provide CA and the EDP. The long Accumulator is kept in memory. On the /370 architecture about 1200 bits are needed for it because of its modest exponent range. Siemens and Hitachi followed immediately the development at IBM. Only 6 weeks after IBM had announced its product ACRITH-XSC and released details of its implementation Hitachi had it already on own computers. So CA and the EDP were state of the art on IBM, Siemens, and Hitachi computers as well as in Pascal-XSC, and in C-XSC 25 years ago.

Placing the long accumulator in the user memory was not done by choice, but by necessity. Computers at these days did not provide enough register space. A disadvantage of this solution is that for each accumulation step, four memory words must be read and written in addition to the two operand loads. (For details see my book). So the scalar product computation speed is limited by the memory to processor transfer bandwidth and speed.

Of course it would be better and much faster to keep the long accumulator as local memory or register space on the arithmetic unit. We realized it that way on our chip (1994). See the attachment. It provides CA and the EDP for the IEEE 754 format double precision. About 4200 bits are needed for the complete register. The chip was developed in a gate array technology as coprocessor for the Intel processor 486 which at that time ran at 32 MHz and got its data via a 32 bit bus. So reading the two 64 bit factors of a product took four cycles. This allowed doing the multiplication in a pipeline also in four steps. So only a 27 bit multiplier was provided. A 53 bit multiplier would at least double the silicon area. The picture shows that the complete register and the carry resolution logic together require about the same silicon area than the fast multiplier. It brings a similar speed up for accumulations than the Wallace tree for multiplications. Under Pascal-XSC the chip performed the EDP between two and four times faster than the main processor an approximation in conventional floating-point arithmetic.

Experience shows that for numerical computing, manufacturers only implement what the arithmetic standard requires. So IEEE P1788 indeed should require Ca and the EDP. Only recommending it is probably not enough to get it. The /370 architecture in the old days was a lucky exception. Here IBM defined the standard. Realization of the exact dot product by other manufacturers was not a big problem.

On Intel and AMD x86-64 processors at least 16 registers are available for words of 128 bits, 2048 bits altogether. This is about twice as much as was needed for the long accumulator on the /370 architecture and it is about 50% of the long accumulator for the IEEE 754 format double precision. It could be used to support an inner part of the complete register by fast hardware. See the section Hardware Accumulation Window in my book. The outer parts of this window would then be handled in software. Probably they would seldom be used. Then the vast majority of scalar products would exactly execute on fast hardware. A scalar product computation, in general, does not require an extremely large exponent range (as might be necessary for polynomials with high powers).

Implementing CA and the EDP is more or less straightforward. A general experience, however, tells that with design tools that are publicly available a processor only reaches about 10% of the speed a company would reach with its own optimized design tools. So to develop its full power the EDP must be incorporated into the ALU of the processor. I have no doubts that we shall get it if we require it.

Interval arithmetic brings guarantees and safe bounds into computing. These bounds frequently are overly pessimistic. The EDP is the most general universal tool for getting close bounds. It brings accuracy and speed. A combination of both is what is needed from a modern computer. We did not get the EDP from IEEE 754. So it is most natural that we provide it in IEEE 1788.

-- 
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch

Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg 
und nationales Großforschungszentrum in der 
Helmholtz-Gesellschaft

Attachment: fig21a.pdf
Description: Adobe PDF document

Attachment: fig21b.pdf
Description: Adobe PDF document

Attachment: References.doc
Description: MS-Word document

Prev by Date: Please listen to Ulrich here...
Next by Date: CA and EDP
Previous by thread: CA and EDP
Next by thread: Please listen to Ulrich here...
Index(es):
- Date
- Thread