Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02



Dear Ulrich,

You appear to have forgotten our previous discussion.

What you are actually trading is latency on interrupts vs speed on 
a highly specialised bit of hardware.

If the EDP is to be a general operation it has to be possible
for the code to be interrupted. If this code is to be re-entrant
then the entire 1024 bit accumulator(s?) have to be flushed
to interrupt stack.

While we are discussing pre-historic processors (Intel? Really?)
the reason that RISC is better than CISC is that small instructions
are more easily interruptible. This — if you recall — was the
problem with the RET instruction on the VAX-780 (where it was actually
more efficient to insert the register saves yourself, rather than use
the RET instruction) which for real-time responsiveness needed to
be interruptible.

Similarly, the students at Berkeley certain showed that _considered_
_as_a_non-interruptible_instruction_ the EDP was 6 times more
efficient than an Intel processor (or as we say in Manchester, twice as
efficient as a modern processor). But the paper was a little vague
on how this meshes with the rest of the instruction set.

(I should say, that getting the interrupts to work properly is what
keeps processor architects awake at night. I have some horror
stories from the MIPS architecture.)

On the matter of modern supercomputer design, it is not the
case that we have silicon to burn. The major issue is power
consumption, and how to get the heat out afterwards. An
Exascale supercomputer could be built now from current
technology. The problems would be: bisection bandwidth
(basically how efficiently data could be moved around the
machine), and the inability of anyone to pay the electricity
bills (order US $100M pa)!

The other issue is the expected instruction mix we
should expect. The vast majority of supercomputer use
today is concerned with “big data” and not numeric
simulation. My friends at FZJ Julich estimate that less
than 1% of the instructions executed on the JUQUEEN
during a numeric simulation are spent doing FP
instructions.

http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/JUQUEEN_node.html

In short a modern supercomputer is more often than not
just a data-mining tool, not the sort of number cruncher
that Cray would have produced.

Anyway, wouldn’t an EDP instruction requirement fit more
naturally into IEEE-754 if it’s needed at all?

Dave


On 26 Jan 2016, at 12:45, Ulrich Kulisch <ulrich.kulisch@xxxxxxx> wrote:

This discussion gives speed a higher priority than accuracy. But it may lead to a wrong result. From the mathematical point of view accuracy must have the higher priority. In the end this leads to the faster solution. No error analysis is necessary.
 
In Numerical Analysis the dot product is ubiquitous. It is not merely a fundamental operation in all vector and matrix spaces. It is the exact dot product which makes residual correction or iterative refinement effective. 
 
The simplest and fastest way for computing a dot product is to compute it exactly. By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed. Pipelining is another way of parallel processing. 

Recently two students at Berkeley have shown that the EDP can be computed in 1/6 of the time the Intel processor needs to compute a possibly wrong result in conventional floating-point arithmetic.The hardware needed for the EDP is comparable to that for a fast multiplier by an adder tree, accepted years ago and now standard technology in every modern processor. The exact dot product brings the same speedup for accumulations at comparable costs. 

Best regards
Ulrich




Am 25.01.2016 um 05:39 schrieb James Demmel:
The memory model here (which is in the process of being written up 
in more detail, and is also described in the references below) deals 
with general memory hierarchies as follows: Suppose there 
are multiple layers, such as L1 cache, L2, L3, DRAM and disk. 
Pick any number of top layers, say L1+L2+L3, and call that 
"fast memory". The remaining layers, DRAM+disk in this 
case, is "slow memory". Let M be the total size of the fast memory. 
The the communication lower bounds are for the number of 
words moved between fast and slow memories, L3 and DRAM 
in this case, if data only moves between adjacent levels. 
The same theory can describe how much data has to move 
from one processor ("fast memory") to another processor 
("slow memory") over a network, in a parallel implementation. 

Now suppose we are doing matmul: C=A*B, and want to use 
a long accumulator (of length R words) to compute each 
entry of C to high accuracy. Then each entry of C must be 
represented by a long accumulator, until it is completely 
computed and rounded back to 1 word. So except for the 
final n^2 writes of C back to slow memory (which is a lower 
order term compared to the lower bound), each entry of 
C is represented by an R word quantity, whether it happens 
to live in a long register, L1 cache, L2 cache, etc. None of 
the information in these R words can be thrown away 
until the final rounding. Thus we are also assuming that 
these R words are not somehow losslessly compressed 
before they are moved to slow memory. 

Think of the ubiquitous "blocking" optimization of matmul: 
each matrix is broken into square b-by-b subblocks, 
where 3 such blocks fit in fast memory. Once they are 
read into fast memory (#words moved = 3*b^2), 
the blocks of A and B are multiplied and added to the 
block of C, so b^3 multiply-adds can be performed. 
If we wanted to compute C to high accuracy, 
all b^2 entries of this block of C would have to be 
represented by R words. It turns out that to attain the 
lower bound below, the blocks can no longer be square, 
but rectangular, with shapes that depend on R. Once 
my colleagues and I go over the details tomorrow, 
I'll distribute them. 

Jim 


On 1/24/16 5:49 PM, Van Snyder wrote: 
Does this result assume only one long accumulator? 

What are the results for more than one long accumulator? 

With hundreds of millions of transistors on one chip, having more than 
one long accumulator doesn't seem like much of a difficulty. 

On Sun, 2016-01-24 at 16:50 -0800, James Demmel wrote: 
Regarding the memory cost of using a long accumulator, 
there is a simple extension of previous communication lower 
bounds for linear algebra, and other algorithms that 
"resemble 3-nested loops" (in a mathematically well-defined way), 
with which one can prove the following: 

Recall that in the conventional case, where all matrix entries 
and "temporary accumulators" occupy 1 word of memory, 
then a lower bound on the amount of memory traffic, between 
"fast memory" (say cache) and "slow memory" (say DRAM) is 
     Omega(#flops / sqrt(M) )  words 
where M is the fast memory size. And this lower bound can 
be attained for many algorithms by well-known tiling or 
divide-and-conquer techniques (for a survey see our 
2014 Acta Numerica paper by Ballard/Carson/Demmel/Hoemmen/ 
Knight/Schwartz). 

Now suppose that we want all the dot products accumulated 
by these algorithms to be accumulated in a long accumulator 
of length R words. For simplicity consider the case of matrix 
multiply C=A*B:  we need a different accumulator for each 
entry of a tile of C, in order to reuse the entries of the tiles 
of A and B in fast memory. So the lower bound analysis may be 
modified to prove that the lower bound increases to 
    Omega(sqrt(R)*#flops/sqrt(M)) 
i.e. the memory traffic increases by a factor of sqrt(R). And 
this may be attained by a different choice of tiles sizes. 

So for an Exact Dot Product, where R >= 64, that means 
at least a 8>=sqrt(R) fold increase in memory traffic. 

For other "nested loop" computations, there is a similar 
analysis, with the memory traffic increasing by R to some 
power, where the exponent depends on the what the 
nested loops look like. For example, for the direct N-body 
problem, the memory traffic would increase by a factor of R, 
so even worse than linear algebra. 

I wrote this up informally recently, with a related but different 
motivation, and am meeting with some collaborators to go over 
the proof tomorrow morning. I had hoped to have this meeting 
before sharing with the 1788 committee, but since Baker posed 
the question, I thought I'd respond. I will let you know whether 
or not the proof survives tomorrows conversation :) . 

Jim 

On 1/24/16 1:26 PM, Ralph Baker Kearfott wrote: 
HMMM. What about memory access bandwidth and speed, and 
its impact on the overall speeds of 
single versus double precision? 

Baker 


-- 
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch
KIT Distinguished Senior Fellow

Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg 
und nationales Großforschungszentrum in der 
Helmholtz-Gesellschaft