Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Fw: Please listen to Ulrich here...



I erred two ways in calculating the number of bits required for binary128 CA EDP.  Since each input pair is first multiplied, the number of fraction bits is doubled, and so is the exponent range.  So I think a correct implementation of CA would need > 65536 bits (not >32768 bits), or >64 (not >256)  128 bit vector registers.

To protect from overflow, an extreme example is to have N value pairs where each is the maximum positive finite number in the 754 representation, followed by N value pairs of the maximum negative values.  To return the correct answer of 0, there have to be enough extra bits that summing the N positive products doesn't overflow, which is ceil(log2(N)).  One extra 12 bit register should be lots.

If I'm concerned about accurate results one of the things I'd do is use quad or at least extended precision instead of just double.  An EDP rounded to double would usually be more accurate than a non-exact double precision DP, but it would have probably fewer bits accuracy as an extended precision DP, and only about half as many bits accuracy as a non-exact quad precision DP (53 bits, not 112 less any intermediate rounding losses).  With typical data it takes a lot to lose more than 59 bits.

I like the elegance of CA, I just don't think it should be required.

- Ian McIntosh          IBM Canada Lab         Compiler Back End Support and Development

----- Forwarded by Ian McIntosh/Toronto/IBM on 08/26/2013 01:52 PM -----


    From:

Ian McIntosh/Toronto/IBM

    To:

stds-1788@xxxxxxxxxxxxxxxxx

    Date:

08/26/2013 12:19 PM

    Subject:

Re: Please listen to Ulrich here...




UK>  When we implemented these ideas for IBM on /370 mainframes in the 1980ies we only used about 1200 bits for the long accumulator and I am not aware of any problem where this was not enough.

I understand, but System/370 had a 7 bit exponent for all precisions.  IEEE 754 binary128 (implemented on z/Architecture and in software on others) has a 15 bit exponent so would need 32,768 bits = 4096 bytes plus some to guard against intermediate overflow.  Some systems have vector registers, typically 128 bits, and this would fill 256 of them.

Even if no problems you saw needed more than 1200 bits, a correct implementation of CA would need >32768 wouldn't it?

- Ian McIntosh          IBM Canada Lab         Compiler Back End Support and Development


Inactive hide details for Ulrich Kulisch ---08/23/2013 04:43:07 PM---Dear Ian, thank you for writing. I think all the problems Ulrich Kulisch ---08/23/2013 04:43:07 PM---Dear Ian, thank you for writing. I think all the problems you see are solved. I


    From:

Ulrich Kulisch <ulrich.kulisch@xxxxxxx>

    To:

Ian McIntosh/Toronto/IBM@IBMCA

    Cc:

"stds-1788@xxxxxxxxxxxxxxxxx" <stds-1788@xxxxxxxxxxxxxxxxx>

    Date:

08/23/2013 04:43 PM

    Subject:

Re: Please listen to Ulrich here...




Dear Ian,

thank you for writing. I think all the problems you see are solved. I assumed that the reader of my mails is more or less familiar with the contents of my book.
Computer Arithmetic and Validity - Theory, Implementation, and Applications. De Gruyter 2008, second edition 2013. You probably find it in your library.

I attach copies of a few pages. You should not be shocked by the large exponent ranges of the IEEE data formats. See the section "Hardware Accumulation Window" in my book. When we implemented these ideas for IBM on /370 mainframes in the 1980ies we only used about 1200 bits for the long accumulator and I am not aware of any problem where this was not enough.

With best regards
Ulrich



Am 23.08.2013 17:19, schrieb Ian McIntosh:

    I'd consider voting yes to requiring EDP but no to requiring (instead of recommending) CA.  There are other ways to calculate EDP so why require CA?

    If I want high precision calculations, the first thing I'd do is use quad not double precision, and quad precision makes CA less practical, needing 8 times the long register size due to 3 more exponent bits.


    CA's long register has to be zeroed at the start  and converted to a floating point value at the end and that takes time.  That wouldn't matter with long vectors but could dominate the time with short ones.


    CA might be the fastest way to compute an EDP, but it isn't necessarily the fastest way to compute a DP.  Making CA handle one add to the long register per cycle may be challenging.  If the long register is a single register it will take time to shift the multiply result to the right position then time for carries to propagate, potentially a long way.  If the long register is implemented as separate registers, one for each group of let's say 64 exponents, then first the right two (or occasionally one) registers must be read, added to and rewritten, with the possibility of a carry to the left potentially through many registers.  If the next value has a similar exponent then it will need the same registers, either by rereading them or via a bypass.  Doing it in one cycle at today's clock rates will get complicated.


    For those who want speed, many of today's computers have parallel pipelines and vector registers and can calculate 2, 4, 8, 16 or more partial dot products in parallel, to be summed at the end.  That's at least competitive with CA and can be faster with no special hardware, although it wouldn't be exact.  For double precision it can be made more accurate with only a constant time slowdown by doing the calculations in quad precision.  If the exponent range was small it could often produce the same or very close to the same result as EDP.


    On the other side, unless the dot product really is exact (rounded to floating point) the error is unknown and creating an interval from it is problematic.  EDP can be useful or even essential.  CA can be useful; I'm just not convinced it's essential.


    - Ian McIntosh          IBM Canada Lab         Compiler Back End Support and Development



    Inactive
          hide details for Ulrich Kulisch ---08/23/2013 06:19:51 AM---If
          you design a chip for a particular application where yoUlrich Kulisch ---08/23/2013 06:19:51 AM---If you design a chip for a particular application where you need a  multiplier the design tool provi
      From: 

    Ulrich Kulisch
    <ulrich.kulisch@xxxxxxx>
      To: 

    Ian McIntosh/Toronto/IBM@IBMCA
      Date: 

    08/23/2013 06:19 AM
      Subject: 

    Re: Please listen to Ulrich here...




    If you design a chip for a particular application where you need a multiplier the design tool provides already an adder tree for fast multiplication. It is already predesigned. I never heard people complain that there are other simpler ways implementing a multiplier which need less silicon. Multiplication is a basic arithmetic operation and a certain "waste of silicon" is accepted to get it fast.

    Now the dot product is a fundamental operation in the vector and matrix spaces and it appears again and again in numerical computations and in mathematics. The assertion is:  The simplest and fastest way computing a dot product (of floating-point numbers) is to compute it exactly. It just needs multiplication, shift and add, and a small amount of local memory on the arithmetic unit (which needs less silicon than an adder tree for fast multiplication).
    (No normalizations and roundings after multiplications and additions, no storing and reading of intermediate results. By pipelining it can be done in the time the processor needs to read the data, i.e., it comes with utmost speed. A possible carry can be absorbed right away, it does not need additional computing time). With the exact dot product you have a correctly or otherwise rounded dot product, of course.
    In contrast to multiplication complaints are coming here: this is a waste of silicon and there may be other perhaps simpler ways of computing a dot product, or why do we need an exact dot product for data which frequently are measured and not exact.

    At the SCAN2000 conference at Karlsruhe Bill Walster gave an invited talk and I cite a nice paragraph from his lecture concerning interval arithmetic:

    Because intervals introduce a "new order of things", we must
    ".... bear in mind that there is nothing more difficult to excute, nor more dubious for success, nor more dangerous to administer than to introduce a new order of things; for he who introduces it has all those who profit from the old order as his enemies, and he has only lukewarm allies in all those who might profit from the new."

    The Prince
     by Niccolo Machiavelli. First published in 1532, several years after his death.

    If you in the first line of this citation replace the word "intervals" by "exact dot product" you have a nice description of the ongoing discussion on the subject.

    By not requiring an EDP we give up a unique chance of improving our computing tool. So please vote YES on motion 47.

    With best regards
    Ulrich Kulisch


    --
    Karlsruher Institut für Technologie (KIT)
    Institut für Angewandte und Numerische Mathematik
    D-76128 Karlsruhe, Germany
    Prof. Ulrich Kulisch

    Telefon: +49 721 608-42680
    Fax: +49 721 608-46679
    E-Mail:
    ulrich.kulisch@xxxxxxx
    www.kit.edu
    www.math.kit.edu/ianm2/~kulisch/

    KIT - Universität des Landes Baden-Württemberg
    und nationales Großforschungszentrum in der
    Helmholtz-Gesellschaft



--
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch

Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail:
ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg
und nationales Großforschungszentrum in der
Helmholtz-Gesellschaft[attachment "Auszug8.tex.pdf" deleted by Ian McIntosh/Toronto/IBM]



GIF image

GIF image