Thread Links			Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

Re: re motion 50 (Just to you...)

To: <fateman@xxxxxxxxxxxx>
Subject: Re: re motion 50 (Just to you...)
From: David Lester <dlester@xxxxxxxxxxxx>
Date: Wed, 18 Sep 2013 11:08:11 +0100
Cc: David Lester <dlester@xxxxxxxxxxxx>, Ulrich Kulisch <ulrich.kulisch@xxxxxxx>, <fateman@xxxxxxxxxxxxxxx>, IEEEP1788a <stds-1788@xxxxxxxx>
Delivered-to: mhonarc@xxxxxxxxxxxxxxxx
In-reply-to: <5238C503.6080108@berkeley.edu>
List-help: <https://listserv.ieee.org/cgi-bin/wa?LIST=STDS-1788>, <mailto:LISTSERV@LISTSERV.IEEE.ORG?body=INFO%20STDS-1788>
List-owner: <mailto:STDS-1788-request@LISTSERV.IEEE.ORG>
List-subscribe: <mailto:STDS-1788-subscribe-request@LISTSERV.IEEE.ORG>
List-unsubscribe: <mailto:STDS-1788-unsubscribe-request@LISTSERV.IEEE.ORG>
References: <52284CB5.6070703@kit.edu> <52344837.3070409@louisiana.edu> <5235DF97.1060902@berkeley.edu> <523857F0.3030305@kit.edu> <5238C503.6080108@berkeley.edu>
Sender: stds-1788@xxxxxxxx

Richard, Ulrich,

Here's a sketch of an EDP algorithm for a modern-style supercomputer chip.

Assumptions:

(*) EDP is the main thing you are interested in.

(*) we have 1024 cpus.

Then the algorithm is:

stage 1: read-in A[i] and B[i] into core [i]. If there are more than 1024 elements in A/B, then
        read-in A[i % 1024] and B[i % 1024] to core i.

stage 2: perform multiplies.

We'll assume that Ulrich's "long accumulator" is smeared over the 1024 cores. If my guess is right,
each core gets _two_ exponent values. (11 bits of exponent in double IEEE representation).

Let's give core[0] exponents +2046, 2047, through to core[1023] with -2047, -2048.

stage 3: Using computed exponent, send results to correct core(s) [it may be that the mantissa
        is split across two cores], using NoC.

stage 4: add all the fragments sent to a core. 

stage 5: use divide & conquer to reduce to a single number.

As you'll appreciate, after stage2, we can work in 64-bit integer arithmetic, which will be
more energy efficient. Apart from the collection process in stage 5 -- which has ten
[ = log(1024) ] splits -- everything is linear-time, and hence pipeline-able.

Depending on the width of the data-access paths we will be able to do stage 1 in parallel, up to
the limit imposed by the path throughput.

Regards,

Dave




On 17 Sep 2013, at 22:09, Richard Fateman wrote:

> On 9/17/2013 6:24 AM, Ulrich Kulisch wrote:
> 
>> 
>> Let me just discuss an explicit example more closely, computing the dot product of two vectors with interval components. What you would like to have is the least enclosure of the set of all dot products of real vectors out of the two interval vectors. Computing the interval dot product in conventional interval arithmetic (what we are going to standardize in P1788) for each interval product of two vector components you round the minimum of all products of the interval bounds downwards and the maximum upwards.
> There is no requirement that an interval dot product be computed by a simple loop
> for i:=1 to n sum a[i]*b[i]      {where sum and * are interval operations}
> just as there is no requirement that a dot product of floats be computed by that same loop.
> 
> If I were computing a dot product of vectors of ordinary floats I might consider
> extra-precise multiplication (via Split/TwoSum/TwoProd   etc.)
> and compensated summation.
> 
> For the analogous interval operation, perhaps the  convenient operations I would need
> are already implicit in the standard, which permits multiple-precision.... For example
> multiplication of 2 double-float intervals [a1,a2] * [b1,b2]  to produce [C,D]  where C and D were
> quad-float numbers. e.g. C = <e,f>  where e + f ,each a double-float, is a representation of exactly the product.
> 
> This would be available as an appropriately overloaded interval mul(), with a quad target precision, e.g.
> quad_mul(a,b).
> 
> I think that quad_add() would be effective in adding the minima and the maxima, vastly decreasing the
> possibility of a significant rounding error affecting the final outcome.
> Or perhaps a compensated summation of the collection of (scalar) values separately.
> 
> While it is possible to add 3 numbers a,b,c  via  EDP(<a,b,c>, <1,1,1>)
> and multiply two numbers by EDP(<a>,<b>), it does not seem economical.
> 
> RJF

Follow-Ups:
- Re: re motion 50 (Just to you...)
  - From: Vincent Lefevre

References:
- new motion
  - From: Ulrich Kulisch
- Motion P1788/M0050:EDP-Without-CA: voting period begins
  - From: Ralph Baker Kearfott
- Re: Motion P1788/M0050:EDP-Without-CA: voting period begins (= motion 50)
  - From: Richard Fateman
- Re: Motion P1788/M0050:EDP-Without-CA: voting period begins (= motion 50)
  - From: Ulrich Kulisch
- re motion 50 (Just to you...)
  - From: Richard Fateman

Prev by Date: Re: re motion 50 (Just to you...)
Next by Date: re motion 50 (Just to you...)
Previous by thread: re motion 50 (Just to you...)
Next by thread: Re: re motion 50 (Just to you...)
Index(es):
- Date
- Thread