Re: re motion 50 (Just to you...)
Richard, Ulrich,
Here's a sketch of an EDP algorithm for a modern-style supercomputer chip.
Assumptions:
(*) EDP is the main thing you are interested in.
(*) we have 1024 cpus.
Then the algorithm is:
stage 1: read-in A[i] and B[i] into core [i]. If there are more than 1024 elements in A/B, then
read-in A[i % 1024] and B[i % 1024] to core i.
stage 2: perform multiplies.
We'll assume that Ulrich's "long accumulator" is smeared over the 1024 cores. If my guess is right,
each core gets _two_ exponent values. (11 bits of exponent in double IEEE representation).
Let's give core[0] exponents +2046, 2047, through to core[1023] with -2047, -2048.
stage 3: Using computed exponent, send results to correct core(s) [it may be that the mantissa
is split across two cores], using NoC.
stage 4: add all the fragments sent to a core.
stage 5: use divide & conquer to reduce to a single number.
As you'll appreciate, after stage2, we can work in 64-bit integer arithmetic, which will be
more energy efficient. Apart from the collection process in stage 5 -- which has ten
[ = log(1024) ] splits -- everything is linear-time, and hence pipeline-able.
Depending on the width of the data-access paths we will be able to do stage 1 in parallel, up to
the limit imposed by the path throughput.
Regards,
Dave
On 17 Sep 2013, at 22:09, Richard Fateman wrote:
> On 9/17/2013 6:24 AM, Ulrich Kulisch wrote:
>
>>
>> Let me just discuss an explicit example more closely, computing the dot product of two vectors with interval components. What you would like to have is the least enclosure of the set of all dot products of real vectors out of the two interval vectors. Computing the interval dot product in conventional interval arithmetic (what we are going to standardize in P1788) for each interval product of two vector components you round the minimum of all products of the interval bounds downwards and the maximum upwards.
> There is no requirement that an interval dot product be computed by a simple loop
> for i:=1 to n sum a[i]*b[i] {where sum and * are interval operations}
> just as there is no requirement that a dot product of floats be computed by that same loop.
>
> If I were computing a dot product of vectors of ordinary floats I might consider
> extra-precise multiplication (via Split/TwoSum/TwoProd etc.)
> and compensated summation.
>
> For the analogous interval operation, perhaps the convenient operations I would need
> are already implicit in the standard, which permits multiple-precision.... For example
> multiplication of 2 double-float intervals [a1,a2] * [b1,b2] to produce [C,D] where C and D were
> quad-float numbers. e.g. C = <e,f> where e + f ,each a double-float, is a representation of exactly the product.
>
> This would be available as an appropriately overloaded interval mul(), with a quad target precision, e.g.
> quad_mul(a,b).
>
> I think that quad_add() would be effective in adding the minima and the maxima, vastly decreasing the
> possibility of a significant rounding error affecting the final outcome.
> Or perhaps a compensated summation of the collection of (scalar) values separately.
>
> While it is possible to add 3 numbers a,b,c via EDP(<a,b,c>, <1,1,1>)
> and multiply two numbers by EDP(<a>,<b>), it does not seem economical.
>
> RJF