Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02



The memory model here (which is in the process of being written up
in more detail, and is also described in the references below) deals
with general memory hierarchies as follows: Suppose there
are multiple layers, such as L1 cache, L2, L3, DRAM and disk.
Pick any number of top layers, say L1+L2+L3, and call that
"fast memory". The remaining layers, DRAM+disk in this
case, is "slow memory". Let M be the total size of the fast memory.
The the communication lower bounds are for the number of
words moved between fast and slow memories, L3 and DRAM
in this case, if data only moves between adjacent levels.
The same theory can describe how much data has to move
from one processor ("fast memory") to another processor
("slow memory") over a network, in a parallel implementation.

Now suppose we are doing matmul: C=A*B, and want to use
a long accumulator (of length R words) to compute each
entry of C to high accuracy. Then each entry of C must be
represented by a long accumulator, until it is completely
computed and rounded back to 1 word. So except for the
final n^2 writes of C back to slow memory (which is a lower
order term compared to the lower bound), each entry of
C is represented by an R word quantity, whether it happens
to live in a long register, L1 cache, L2 cache, etc. None of
the information in these R words can be thrown away
until the final rounding. Thus we are also assuming that
these R words are not somehow losslessly compressed
before they are moved to slow memory.

Think of the ubiquitous "blocking" optimization of matmul:
each matrix is broken into square b-by-b subblocks,
where 3 such blocks fit in fast memory. Once they are
read into fast memory (#words moved = 3*b^2),
the blocks of A and B are multiplied and added to the
block of C, so b^3 multiply-adds can be performed.
If we wanted to compute C to high accuracy,
all b^2 entries of this block of C would have to be
represented by R words. It turns out that to attain the
lower bound below, the blocks can no longer be square,
but rectangular, with shapes that depend on R. Once
my colleagues and I go over the details tomorrow,
I'll distribute them.

Jim


On 1/24/16 5:49 PM, Van Snyder wrote:
Does this result assume only one long accumulator?

What are the results for more than one long accumulator?

With hundreds of millions of transistors on one chip, having more than
one long accumulator doesn't seem like much of a difficulty.

On Sun, 2016-01-24 at 16:50 -0800, James Demmel wrote:
Regarding the memory cost of using a long accumulator,
there is a simple extension of previous communication lower
bounds for linear algebra, and other algorithms that
"resemble 3-nested loops" (in a mathematically well-defined way),
with which one can prove the following:

Recall that in the conventional case, where all matrix entries
and "temporary accumulators" occupy 1 word of memory,
then a lower bound on the amount of memory traffic, between
"fast memory" (say cache) and "slow memory" (say DRAM) is
     Omega(#flops / sqrt(M) )  words
where M is the fast memory size. And this lower bound can
be attained for many algorithms by well-known tiling or
divide-and-conquer techniques (for a survey see our
2014 Acta Numerica paper by Ballard/Carson/Demmel/Hoemmen/
Knight/Schwartz).

Now suppose that we want all the dot products accumulated
by these algorithms to be accumulated in a long accumulator
of length R words. For simplicity consider the case of matrix
multiply C=A*B:  we need a different accumulator for each
entry of a tile of C, in order to reuse the entries of the tiles
of A and B in fast memory. So the lower bound analysis may be
modified to prove that the lower bound increases to
    Omega(sqrt(R)*#flops/sqrt(M))
i.e. the memory traffic increases by a factor of sqrt(R). And
this may be attained by a different choice of tiles sizes.

So for an Exact Dot Product, where R >= 64, that means
at least a 8>=sqrt(R) fold increase in memory traffic.

For other "nested loop" computations, there is a similar
analysis, with the memory traffic increasing by R to some
power, where the exponent depends on the what the
nested loops look like. For example, for the direct N-body
problem, the memory traffic would increase by a factor of R,
so even worse than linear algebra.

I wrote this up informally recently, with a related but different
motivation, and am meeting with some collaborators to go over
the proof tomorrow morning. I had hoped to have this meeting
before sharing with the 1788 committee, but since Baker posed
the question, I thought I'd respond. I will let you know whether
or not the proof survives tomorrows conversation :) .

Jim

On 1/24/16 1:26 PM, Ralph Baker Kearfott wrote:
HMMM. What about memory access bandwidth and speed, and
its impact on the overall speeds of
single versus double precision?

Baker

On 01/24/2016 12:05 PM, Richard Fateman wrote:
The (re) submission of these letters to the discussion
raised a question in my mind.

    I am puzzled by the WG 2.5 vote urging the IEEE 754 committee to
incorporate two facilities.

*  [double precision interval arithmetic] *_at the speed of _*simple
floating-point arithmetic.
* provide___*high-speed*_ arithmetic [for dynamic precision intervals]

What puzzles me is the statement of speed requirements in a standard of
this nature.
Was WG 2.5 thinking that the IEEE committee was derelict in its duties
in allowing
some or all of the 754 arithmetic operations to be provided in slow
speed?