Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02



This discussion gives speed a higher priority than accuracy. But it may lead to a wrong result. From the mathematical point of view accuracy must have the higher priority. In the end this leads to the faster solution. No error analysis is necessary.

 

In Numerical Analysis the dot product is ubiquitous. It is not merely a fundamental operation in all vector and matrix spaces. It is the exact dot product which makes residual correction or iterative refinement effective.

 

The simplest and fastest way for computing a dot product is to compute it exactly. By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed. Pipelining is another way of parallel processing.


Recently two students at Berkeley have shown that the EDP can be computed in 1/6 of the time the Intel processor needs to compute a possibly wrong result in conventional floating-point arithmetic.The hardware needed for the EDP is comparable to that for a fast multiplier by an adder tree, accepted years ago and now standard technology in every modern processor. The exact dot product brings the same speedup for accumulations at comparable costs.


Best regards
Ulrich




Am 25.01.2016 um 05:39 schrieb James Demmel:
The memory model here (which is in the process of being written up
in more detail, and is also described in the references below) deals
with general memory hierarchies as follows: Suppose there
are multiple layers, such as L1 cache, L2, L3, DRAM and disk.
Pick any number of top layers, say L1+L2+L3, and call that
"fast memory". The remaining layers, DRAM+disk in this
case, is "slow memory". Let M be the total size of the fast memory.
The the communication lower bounds are for the number of
words moved between fast and slow memories, L3 and DRAM
in this case, if data only moves between adjacent levels.
The same theory can describe how much data has to move
from one processor ("fast memory") to another processor
("slow memory") over a network, in a parallel implementation.

Now suppose we are doing matmul: C=A*B, and want to use
a long accumulator (of length R words) to compute each
entry of C to high accuracy. Then each entry of C must be
represented by a long accumulator, until it is completely
computed and rounded back to 1 word. So except for the
final n^2 writes of C back to slow memory (which is a lower
order term compared to the lower bound), each entry of
C is represented by an R word quantity, whether it happens
to live in a long register, L1 cache, L2 cache, etc. None of
the information in these R words can be thrown away
until the final rounding. Thus we are also assuming that
these R words are not somehow losslessly compressed
before they are moved to slow memory.

Think of the ubiquitous "blocking" optimization of matmul:
each matrix is broken into square b-by-b subblocks,
where 3 such blocks fit in fast memory. Once they are
read into fast memory (#words moved = 3*b^2),
the blocks of A and B are multiplied and added to the
block of C, so b^3 multiply-adds can be performed.
If we wanted to compute C to high accuracy,
all b^2 entries of this block of C would have to be
represented by R words. It turns out that to attain the
lower bound below, the blocks can no longer be square,
but rectangular, with shapes that depend on R. Once
my colleagues and I go over the details tomorrow,
I'll distribute them.

Jim


On 1/24/16 5:49 PM, Van Snyder wrote:
Does this result assume only one long accumulator?

What are the results for more than one long accumulator?

With hundreds of millions of transistors on one chip, having more than
one long accumulator doesn't seem like much of a difficulty.

On Sun, 2016-01-24 at 16:50 -0800, James Demmel wrote:
Regarding the memory cost of using a long accumulator,
there is a simple extension of previous communication lower
bounds for linear algebra, and other algorithms that
"resemble 3-nested loops" (in a mathematically well-defined way),
with which one can prove the following:

Recall that in the conventional case, where all matrix entries
and "temporary accumulators" occupy 1 word of memory,
then a lower bound on the amount of memory traffic, between
"fast memory" (say cache) and "slow memory" (say DRAM) is
     Omega(#flops / sqrt(M) )  words
where M is the fast memory size. And this lower bound can
be attained for many algorithms by well-known tiling or
divide-and-conquer techniques (for a survey see our
2014 Acta Numerica paper by Ballard/Carson/Demmel/Hoemmen/
Knight/Schwartz).

Now suppose that we want all the dot products accumulated
by these algorithms to be accumulated in a long accumulator
of length R words. For simplicity consider the case of matrix
multiply C=A*B:  we need a different accumulator for each
entry of a tile of C, in order to reuse the entries of the tiles
of A and B in fast memory. So the lower bound analysis may be
modified to prove that the lower bound increases to
    Omega(sqrt(R)*#flops/sqrt(M))
i.e. the memory traffic increases by a factor of sqrt(R). And
this may be attained by a different choice of tiles sizes.

So for an Exact Dot Product, where R >= 64, that means
at least a 8>=sqrt(R) fold increase in memory traffic.

For other "nested loop" computations, there is a similar
analysis, with the memory traffic increasing by R to some
power, where the exponent depends on the what the
nested loops look like. For example, for the direct N-body
problem, the memory traffic would increase by a factor of R,
so even worse than linear algebra.

I wrote this up informally recently, with a related but different
motivation, and am meeting with some collaborators to go over
the proof tomorrow morning. I had hoped to have this meeting
before sharing with the 1788 committee, but since Baker posed
the question, I thought I'd respond. I will let you know whether
or not the proof survives tomorrows conversation :) .

Jim

On 1/24/16 1:26 PM, Ralph Baker Kearfott wrote:
HMMM. What about memory access bandwidth and speed, and
its impact on the overall speeds of
single versus double precision?

Baker


-- 
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch
KIT Distinguished Senior Fellow

Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg 
und nationales Großforschungszentrum in der 
Helmholtz-Gesellschaft