Regarding the memory cost of using a long accumulator,
there is a simple extension of previous communication lower
bounds for linear algebra, and other algorithms that
"resemble 3-nested loops" (in a mathematically well-defined way),
with which one can prove the following:
Recall that in the conventional case, where all matrix entries
and "temporary accumulators" occupy 1 word of memory,
then a lower bound on the amount of memory traffic, between
"fast memory" (say cache) and "slow memory" (say DRAM) is
Omega(#flops / sqrt(M) ) words
where M is the fast memory size. And this lower bound can
be attained for many algorithms by well-known tiling or
divide-and-conquer techniques (for a survey see our
2014 Acta Numerica paper by Ballard/Carson/Demmel/Hoemmen/
Knight/Schwartz).
Now suppose that we want all the dot products accumulated
by these algorithms to be accumulated in a long accumulator
of length R words. For simplicity consider the case of matrix
multiply C=A*B: we need a different accumulator for each
entry of a tile of C, in order to reuse the entries of the tiles
of A and B in fast memory. So the lower bound analysis may be
modified to prove that the lower bound increases to
Omega(sqrt(R)*#flops/sqrt(M))
i.e. the memory traffic increases by a factor of sqrt(R). And
this may be attained by a different choice of tiles sizes.
So for an Exact Dot Product, where R >= 64, that means
at least a 8>=sqrt(R) fold increase in memory traffic.
For other "nested loop" computations, there is a similar
analysis, with the memory traffic increasing by R to some
power, where the exponent depends on the what the
nested loops look like. For example, for the direct N-body
problem, the memory traffic would increase by a factor of R,
so even worse than linear algebra.
I wrote this up informally recently, with a related but different
motivation, and am meeting with some collaborators to go over
the proof tomorrow morning. I had hoped to have this meeting
before sharing with the 1788 committee, but since Baker posed
the question, I thought I'd respond. I will let you know whether
or not the proof survives tomorrows conversation :) .
Jim
On 1/24/16 1:26 PM, Ralph Baker Kearfott wrote:
HMMM. What about memory access bandwidth and speed, and
its impact on the overall speeds of
single versus double precision?
Baker
On 01/24/2016 12:05 PM, Richard Fateman wrote:
The (re) submission of these letters to the discussion
raised a question in my mind.
I am puzzled by the WG 2.5 vote urging the IEEE 754 committee to
incorporate two facilities.
* [double precision interval arithmetic] *_at the speed of _*simple
floating-point arithmetic.
* provide___*high-speed*_ arithmetic [for dynamic precision intervals]
What puzzles me is the statement of speed requirements in a standard of
this nature.
Was WG 2.5 thinking that the IEEE committee was derelict in its duties
in allowing
some or all of the 754 arithmetic operations to be provided in slow
speed?