Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Motion 31: V04.2 Revision of proposed Level 1 text



UK> Actually the simplest and fastest way for computing a dot product is to compute it exactly! By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed.

There are three ways to calculate dot product: exact, simple, or fast.

Computing a dot product either exactly or simply is generally not the fastest way on today's hardware. There is a faster way on any system that takes more than one cycle to calculate sum+a*b, or that has multiple floating point pipelines (including vector register operations, multiple CPUs or multiple cores), and that can also transfer data quickly enough.

For a simple example, assume just one pipeline and using an fma to calculate sum=sum+a*b in one operation, but taking 2 cycles to produce the result. You can split the vectors in half, compute the dot products of each half, and add them together at the end. By interleaving those one cycle apart you almost double the speed.

For another simple example, assume two pipelines each able to calculate sum=sum+a*b in one fma operation, each taking only 1 cycle to produce the result. Change the calculation the same way as above and you will double the speed.

If the original program is

sum = 0.;
for (i = 0; i < n; i = i+1)
{
sum = sum + a[i]*b[i];
}

one way is to change it to

sum1 = 0.; sum2 = 0.; /* done in parallel */
for (i = 0; i+1 < n; i = i+2)
{
sum1 = sum1 + a[i]*b[i]; sum2 = sum2 + a[i+1]*b[i+1]; /* done in parallel */
}
sum = sum1 + sum2;
if ((n % 2) != 0)
{
sum = sum + a[n-1]*b[n-1];
}

Extending this, it's common on today's CPUs to compute dot products almost 8 or 16 or 32 times as fast as the straightforward way. On highly parallel systems with enough data it can be thousands of times faster. The catch is that due to a different order of rounding, the result may differ. The IBM XL compilers and others do this transformation automatically when the compiler optimization options allow it.

One reason Fortran has a dotproduct function is to explicitly allow this kind of performance boost even without compiler optimization when the user doesn't mind minor differences in the result in exchange for an order of magnitude speedup. The fact that in Fortran the fast function is named just "dotproduct" is one reason I'd prefer using exactdotproduct for the exact version; the other is that I dislike ambiguity.

This doesn't rule out constructing computers with fast parallel dot product hardware, but AFAIK they don't exist yet.

- Ian McIntosh IBM Canada Lab Compiler Back End Support and Development


Inactive hide details for Ulrich Kulisch ---02/07/2012 01:44:05 PM---Am 07.02.2012 13:08, schrieb John Pryce: > Ulrich, VincentUlrich Kulisch ---02/07/2012 01:44:05 PM---Am 07.02.2012 13:08, schrieb John Pryce: > Ulrich, Vincent


From:

Ulrich Kulisch <Ulrich.Kulisch@xxxxxxxxxxx>

To:

Ian McIntosh/Toronto/IBM@IBMCA

Date:

02/07/2012 01:44 PM

Subject:

Re: Motion 31: V04.2 Revision of proposed Level 1 text





Am 07.02.2012 13:08, schrieb John Pryce:
      Ulrich, Vincent

      On 26 Jan 2012, at 13:02, Vincent Lefevre wrote:

          On 2012-01-25 12:00:38 +0000, John Pryce wrote:
              I send herewith a revised draft of the proposed Level 1 text, less
              decorations, that I originally circulated on 5 Dec 2011.

          About the basic arithmetic operations:
          ...
          But why is the dot product included in the basic arithmetic operations?
          I don't see it as important as +, -, *, /. And not more important than
          recip, sqr, case, pown, abs, and perhaps some other operations.

      Ulrich: on reflection I agree with Vincent. However useful the dot product is, it is essentially more complicated than +, -, *, /. I think calling it "basic" will cause more disagreement than agreement, so I have removed it from those lists.

      John P

John,

I totally disagree with this decision. I am very busy with other things right now. But let me simply remind you that the IFIP Working Group on Numerical Software required the exact dot product by a letter to the standards committee IEEE 754R (dated Sept. 4, 2007) and by another letter to IEEE P1788 (dated Sept. 9, 2009). I attach copies of these letters to this mail.

I also disagree with the statement that the exact dot product is essentially more complicated than +, -, *, /.

Actually the
simplest and fastest way for computing a dot product is to compute it exactly! By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed. No software simulation can compete with a simple and direct hardware solution. I attach a copy of the poster that I prepared for the SCAN-Meeting at Lyon in 2010.

let me finally mention that the following is shown in my book 'Computer Arithmetic and Validity', de Gruyter 2008. With the two requirements of the IFIP Working Group letter to IEEE 754R: Fast hardware support for interval arithmetic and the exact dot product, all operations in the usual product spaces of computation, the complex numbers, the real and complex intervals, the real and complex vectors and matrices, and the real and complex interval vectors and interval matrices can be computed with least bit accuracy and at very high speed. These operations are distinctly different from those traditionally available on computers. This would boost both the speed of a computation and the accuracy of its result.

Best regards
Ulrich

--
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik (IANM2)
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch

Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail:
ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg und nationales Großforschungszentrum in der Helmholtz-Gemeinschaft

[attachment "IFIPWG-IEEE754R.pdf" deleted by Ian McIntosh/Toronto/IBM] [attachment "IFIPWG-IEEE-P1788.pdf" deleted by Ian McIntosh/Toronto/IBM] [attachment "Poster22.pdf" deleted by Ian McIntosh/Toronto/IBM]