Thread Links			Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

Re: Motion 31: V04.2 Revision of proposed Level 1 text

To: Ian McIntosh <ianm@xxxxxxxxxx>
Subject: Re: Motion 31: V04.2 Revision of proposed Level 1 text
From: Ulrich Kulisch <Ulrich.Kulisch@xxxxxxxxxxx>
Date: Fri, 10 Feb 2012 10:49:35 +0100
Cc: stds-1788@xxxxxxxxxxxxxxxxx
Delivered-to: mhonarc@xxxxxxxxxxxxxxxx
In-reply-to: <OF480DDAA7.A0A34771-ON8525799E.0075EA43-8525799F.0012B0F8@xxxxxxxxxx>
List-help: <https://listserv.ieee.org/cgi-bin/wa?LIST=STDS-1788>, <mailto:LISTSERV@LISTSERV.IEEE.ORG?body=INFO%20STDS-1788>
List-owner: <mailto:STDS-1788-request@LISTSERV.IEEE.ORG>
List-subscribe: <mailto:STDS-1788-subscribe-request@LISTSERV.IEEE.ORG>
List-unsubscribe: <mailto:STDS-1788-unsubscribe-request@LISTSERV.IEEE.ORG>
References: <B94F5482-8E86-4DA6-8E5E-10B8544673F1@xxxxxxxxxxxx> <20120126130210.GB21977@xxxxxxxxxxxxxxx> <DE3133C0-448C-4685-8C93-FD6EB0845466@xxxxxxxxxxxx> <4F316F88.4070503@xxxxxxxxxxx> <OF480DDAA7.A0A34771-ON8525799E.0075EA43-8525799F.0012B0F8@xxxxxxxxxx>
Sender: stds-1788@xxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:9.0) Gecko/20111222 Thunderbird/9.0.1

Dear Ian,

I thank you very much for your mail. It is very intersting and I am happy to have it.

But I feel that we are talking about two different things. You are talking about computing the dot product (exactly, simply, or fast) on today's (existing) hardware, while I am talking about computing an exact dot product on a single (perhaps not yet existing) processor. I am sure that my method also could be used on parallel machines with a speed up that perhaps is a little lower than yours.

Let me comment a little on the two methods. The fast method you describe, if I understand it correctly, alters the sequence of the summands. This was also done by the old vector units (of IBM, CRAY, Fujitsu, Hitachi, or NEC). In an average application this may lead to a reasonable result. But I have a large collection of examples for all these vector units where the computed result does not contain a single correct digit and if you know how the summands are exchanged it is easy to find such examples.

I need the dot product for implementing interval arithmetic of dynamic precision, of varible word lenghts. In interval arithmetic I have to deliver a guaranteed answer. So I can not use an approximate dot product, I need the dot product exactly. We have implemented the exact dot product as a coprocessor for the PC in the 1990s, and we have done it with IBM for their /370 architecture in the 1980s (see reference [13] in motion 9). It was later copied by Siemens and Hitachi also for their /370 architecture. By pipelining you can get it with a similar speed up than the approximate dot products on the old vector units.

For interval evaluation of an algorithm (a sequence of arithmetic operations) in the real number field an old result of interval arithmetic says roughly that increasing the precision by k digits reduces the error bounds by b^-k, i.e., results can always be guaranteed to a number of correct digits by using variable precision interval arithmetic. All intervallers dream of this kind of applications for half a century. With today's technology they are in immediate reach. The exact dot product is the central building block for it and I think it is a must that IEEE P1788 requests it.

Best regards
Ulrich

Am 09.02.2012 04:24, schrieb Ian McIntosh:

UK> Actually the simplest and fastest way for computing a dot product is to compute it exactly! By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed.

There are three ways to calculate dot product: exact, simple, or fast.

Computing a dot product either exactly or simply is generally not the fastest way on today's hardware. There is a faster way on any system that takes more than one cycle to calculate sum+a*b, or that has multiple floating point pipelines (including vector register operations, multiple CPUs or multiple cores), and that can also transfer data quickly enough.

For a simple example, assume just one pipeline and using an fma to calculate sum=sum+a*b in one operation, but taking 2 cycles to produce the result. You can split the vectors in half, compute the dot products of each half, and add them together at the end. By interleaving those one cycle apart you almost double the speed.

For another simple example, assume two pipelines each able to calculate sum=sum+a*b in one fma operation, each taking only 1 cycle to produce the result. Change the calculation the same way as above and you will double the speed.

If the original program is

sum = 0.;
for (i = 0; i < n; i = i+1)
{
sum = sum + a[i]*b[i];
}

one way is to change it to

sum1 = 0.; sum2 = 0.; /* done in parallel */
for (i = 0; i+1 < n; i = i+2)
{
sum1 = sum1 + a[i]*b[i]; sum2 = sum2 + a[i+1]*b[i+1]; /* done in parallel */
}
sum = sum1 + sum2;
if ((n % 2) != 0)
{
sum = sum + a[n-1]*b[n-1];
}

Extending this, it's common on today's CPUs to compute dot products almost 8 or 16 or 32 times as fast as the straightforward way. On highly parallel systems with enough data it can be thousands of times faster. The catch is that due to a different order of rounding, the result may differ. The IBM XL compilers and others do this transformation automatically when the compiler optimization options allow it.

One reason Fortran has a dotproduct function is to explicitly allow this kind of performance boost even without compiler optimization when the user doesn't mind minor differences in the result in exchange for an order of magnitude speedup. The fact that in Fortran the fast function is named just "dotproduct" is one reason I'd prefer using exactdotproduct for the exact version; the other is that I dislike ambiguity.

This doesn't rule out constructing computers with fast parallel dot product hardware, but AFAIK they don't exist yet.

- Ian McIntosh IBM Canada Lab Compiler Back End Support and Development

Ulrich Kulisch ---02/07/2012 01:44:05 PM---Am 07.02.2012 13:08, schrieb John Pryce: > Ulrich, Vincent

From:
Ulrich Kulisch <Ulrich.Kulisch@xxxxxxxxxxx>

To:
Ian McIntosh/Toronto/IBM@IBMCA

Date:
02/07/2012 01:44 PM

Subject:
Re: Motion 31: V04.2 Revision of proposed Level 1 text

Am 07.02.2012 13:08, schrieb John Pryce:

Ulrich, Vincent On 26 Jan 2012, at 13:02, Vincent Lefevre wrote:

On 2012-01-25 12:00:38 +0000, John Pryce wrote:

I send herewith a revised draft of the proposed Level 1 text, less decorations, that I originally circulated on 5 Dec 2011.

About the basic arithmetic operations: ... But why is the dot product included in the basic arithmetic operations? I don't see it as important as +, -, *, /. And not more important than recip, sqr, case, pown, abs, and perhaps some other operations.

Ulrich: on reflection I agree with Vincent. However useful the dot product is, it is essentially more complicated than +, -, *, /. I think calling it "basic" will cause more disagreement than agreement, so I have removed it from those lists. John P

John,

I totally disagree with this decision. I am very busy with other things right now. But let me simply remind you that the IFIP Working Group on Numerical Software required the exact dot product by a letter to the standards committee IEEE 754R (dated Sept. 4, 2007) and by another letter to IEEE P1788 (dated Sept. 9, 2009). I attach copies of these letters to this mail.

I also disagree with the statement that the exact dot product is essentially more complicated than +, -, *, /.

Actually the simplest and fastest way for computing a dot product is to compute it exactly! By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed. No software simulation can compete with a simple and direct hardware solution. I attach a copy of the poster that I prepared for the SCAN-Meeting at Lyon in 2010.

let me finally mention that the following is shown in my book 'Computer Arithmetic and Validity', de Gruyter 2008. With the two requirements of the IFIP Working Group letter to IEEE 754R: Fast hardware support for interval arithmetic and the exact dot product, all operations in the usual product spaces of computation, the complex numbers, the real and complex intervals, the real and complex vectors and matrices, and the real and complex interval vectors and interval matrices can be computed with least bit accuracy and at very high speed. These operations are distinctly different from those traditionally available on computers. This would boost both the speed of a computation and the accuracy of its result.

Best regards
Ulrich
-- Karlsruher Institut für Technologie (KIT) Institut für Angewandte und Numerische Mathematik (IANM2) D-76128 Karlsruhe, Germany Prof. Ulrich Kulisch Telefon: +49 721 608-42680 Fax: +49 721 608-46679 E-Mail:ulrich.kulisch@xxxxxxxwww.kit.eduwww.math.kit.edu/ianm2/~kulisch/KIT - Universität des Landes Baden-Württemberg und nationales Großforschungszentrum in der Helmholtz-Gemeinschaft [attachment "IFIPWG-IEEE754R.pdf" deleted by Ian McIntosh/Toronto/IBM] [attachment "IFIPWG-IEEE-P1788.pdf" deleted by Ian McIntosh/Toronto/IBM] [attachment "Poster22.pdf" deleted by Ian McIntosh/Toronto/IBM]

-- 
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik (IANM2)
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch

Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg und nationales Großforschungszentrum in der Helmholtz-Gemeinschaft

Follow-Ups:
- Re: Motion 31: V04.2 Revision of proposed Level 1 text
  - From: Ian McIntosh

References:
- Motion 31: V04.2 Revision of proposed Level 1 text
  - From: John Pryce
- Re: Motion 31: V04.2 Revision of proposed Level 1 text
  - From: Vincent Lefevre
- Re: Motion 31: V04.2 Revision of proposed Level 1 text
  - From: John Pryce
- Re: Motion 31: V04.2 Revision of proposed Level 1 text
  - From: Ulrich Kulisch
- Re: Motion 31: V04.2 Revision of proposed Level 1 text
  - From: Ian McIntosh

Prev by Date: Re: Midpoint paper (2012-02-08 version)
Next by Date: Re: Motion 31: V04.2 Revision of proposed Level 1 text
Previous by thread: Re: Motion 31: V04.2 Revision of proposed Level 1 text
Next by thread: Re: Motion 31: V04.2 Revision of proposed Level 1 text
Index(es):
- Date
- Thread