Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: On behalf of Jim Demmel -- Re: back to the roots



This discussion about Exact Dot Product raises other questions: how to compute the EDP in
parallel? Or more generally, what is the complete set of operations one may perform on the
long accumulator that is (at least implicitly) used to compute the EDP? And how expensive
must it be?

These questions arise because to implement correct rounding of the EDP on a parallel machine
in a reasonably efficient way seems to require more underlying operations than
just "correctly round a sum of floats (or products of floats)". Suppose the summands are 
spread across multiple processors, and we want to compute the EDP in parallel. In other words,
we don't want to send all the summands to one processor and compute the EDP there.

There are two natural parallel algorithms:
(1) Each processor computes the exact dot product of its local data. 
Then each long accumulator is passed up a reduction tree (of whatever
shape the parallel machine supports), where at each node of the
tree the exact sum of two (or more) long accumulators is computed,
and passed further up the tree.
(2) Each processor again computes the exact dot product of its local
data. Then the sum is converted into an array of floats with the same
exact sum. This array is again passed up a reduction tree, where at
each tree node all the incoming floats are summed using the EDP, converted
to another array of floats and passed further up the tree.

Scheme (1) assumes there is an operation that adds/subtracts long accumulators.
Scheme (2) assumes there is an efficient way to exactly write a long accumulator as
a sum of floats. This can be done (very inefficiently!) by rounding the EDP to
get the leading bits (call the value x), computing a second dot product
with the same summands as before in addition to -1*x to cancel the
leading bits, rounding to get the next set of leading bits, and so on.

In either scenario, in IEEE double precision, you would need to send up to 
about 4K bits along each edge of the reduction tree, a lot of communication.
This is a significant performance disadvantage of the EDP, versus
the usual multiple precision, or our (pre-rounding) reproducible summation technique.

Best regards,
    Diep

On Jul 2, 2013, at 11:46 AM, Vincent Lefevre <vincent@xxxxxxxxxx> wrote:

> On 2013-07-02 16:08:01 +0200, Ulrich Kulisch wrote:
>> Am 01.07.2013 21:10, schrieb Vincent Lefevre:
>>> On 2013-07-01 13:39:02 -0500, Ralph Baker Kearfott wrote:
>>>> Please see the following communication from Jim.
>>> My point is that the precision needs to be controlled by the user.
>>> Sometimes small precision datums are sufficient, sometimes one needs
>>> more accuracy. P1788 doesn't specify how precision is controlled
>>> (and I think it shouldn't, because there are many possibilities each
>>> one with its own advantages), but it doesn't forbid that. Ensuring
>>> exactness is in general not possible (even if you have EDP with
>>> fixed-precision inputs, as the result can't be reused as an input).
>>> 
>> But the rouned result of the EDP is of high accuaracy. It does not suffer
>> from cancellation.
> 
> So, what you need is not exact dot product, but accurate dot product
> (possibly with a result in a type with a larger precision than the
> inputs).
> 
> -- 
> Vincent Lefèvre <vincent@xxxxxxxxxx> - Web: <http://www.vinc17.net/>
> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)