Dear Ulrich,
You appear to have forgotten our previous discussion.
What you are actually trading is latency on interrupts vs speed on a highly specialised bit of hardware.
If the EDP is to be a general operation it has to be possible for the code to be interrupted. If this code is to be re-entrant then the entire 1024 bit accumulator(s?) have to be flushed to interrupt stack.
While we are discussing pre-historic processors (Intel? Really?) the reason that RISC is better than CISC is that small instructions are more easily interruptible. This — if you recall — was the problem with the RET instruction on the VAX-780 (where it was actually more efficient to insert the register saves yourself, rather than use the RET instruction) which for real-time responsiveness needed to be interruptible.
Similarly, the students at Berkeley certain showed that _considered_ _as_a_non-interruptible_instruction_ the EDP was 6 times more efficient than an Intel processor (or as we say in Manchester, twice as efficient as a modern processor). But the paper was a little vague on how this meshes with the rest of the instruction set.
(I should say, that getting the interrupts to work properly is what keeps processor architects awake at night. I have some horror stories from the MIPS architecture.)
On the matter of modern supercomputer design, it is not the case that we have silicon to burn. The major issue is power consumption, and how to get the heat out afterwards. An Exascale supercomputer could be built now from current technology. The problems would be: bisection bandwidth (basically how efficiently data could be moved around the machine), and the inability of anyone to pay the electricity bills (order US $100M pa)!
The other issue is the expected instruction mix we should expect. The vast majority of supercomputer use today is concerned with “big data” and not numeric simulation. My friends at FZJ Julich estimate that less than 1% of the instructions executed on the JUQUEEN during a numeric simulation are spent doing FP instructions.
In short a modern supercomputer is more often than not just a data-mining tool, not the sort of number cruncher that Cray would have produced.
Anyway, wouldn’t an EDP instruction requirement fit more naturally into IEEE-754 if it’s needed at all?
Dave
This discussion gives speed a higher priority than accuracy. But it may lead to a wrong result. From the mathematical point of view accuracy must have the higher priority. In the end this leads to the faster solution. No error analysis is necessary. In Numerical Analysis the dot product is ubiquitous. It is not merely a fundamental operation in all vector and matrix spaces. It is the exact dot product which makes residual correction or iterative refinement effective. The simplest and fastest way for computing a dot product is to compute it exactly. By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed. Pipelining is another way of parallel processing.
Recently two students at Berkeley have shown that the EDP can be computed in 1/6 of the time the Intel processor needs to compute a possibly wrong result in conventional floating-point arithmetic.The hardware needed for the EDP is comparable to that for a fast multiplier by an adder tree, accepted years ago and now standard technology in every modern processor. The exact dot product brings the same speedup for accumulations at comparable costs.
Best regards Ulrich Am 25.01.2016 um 05:39 schrieb James Demmel: The memory model here (which is in the process of being written up in more detail, and is also described in the references below) deals with general memory hierarchies as follows: Suppose there are multiple layers, such as L1 cache, L2, L3, DRAM and disk. Pick any number of top layers, say L1+L2+L3, and call that "fast memory". The remaining layers, DRAM+disk in this case, is "slow memory". Let M be the total size of the fast memory. The the communication lower bounds are for the number of words moved between fast and slow memories, L3 and DRAM in this case, if data only moves between adjacent levels. The same theory can describe how much data has to move from one processor ("fast memory") to another processor ("slow memory") over a network, in a parallel implementation.
Now suppose we are doing matmul: C=A*B, and want to use a long accumulator (of length R words) to compute each entry of C to high accuracy. Then each entry of C must be represented by a long accumulator, until it is completely computed and rounded back to 1 word. So except for the final n^2 writes of C back to slow memory (which is a lower order term compared to the lower bound), each entry of C is represented by an R word quantity, whether it happens to live in a long register, L1 cache, L2 cache, etc. None of the information in these R words can be thrown away until the final rounding. Thus we are also assuming that these R words are not somehow losslessly compressed before they are moved to slow memory.
Think of the ubiquitous "blocking" optimization of matmul: each matrix is broken into square b-by-b subblocks, where 3 such blocks fit in fast memory. Once they are read into fast memory (#words moved = 3*b^2), the blocks of A and B are multiplied and added to the block of C, so b^3 multiply-adds can be performed. If we wanted to compute C to high accuracy, all b^2 entries of this block of C would have to be represented by R words. It turns out that to attain the lower bound below, the blocks can no longer be square, but rectangular, with shapes that depend on R. Once my colleagues and I go over the details tomorrow, I'll distribute them.
Jim
On 1/24/16 5:49 PM, Van Snyder wrote:
Does this result assume only one long accumulator?
What are the results for more than one long accumulator?
With hundreds of millions of transistors on one chip, having more than one long accumulator doesn't seem like much of a difficulty.
On Sun, 2016-01-24 at 16:50 -0800, James Demmel wrote:
Regarding the memory cost of using a long accumulator, there is a simple extension of previous communication lower bounds for linear algebra, and other algorithms that "resemble 3-nested loops" (in a mathematically well-defined way), with which one can prove the following:
Recall that in the conventional case, where all matrix entries and "temporary accumulators" occupy 1 word of memory, then a lower bound on the amount of memory traffic, between "fast memory" (say cache) and "slow memory" (say DRAM) is Omega(#flops / sqrt(M) ) words where M is the fast memory size. And this lower bound can be attained for many algorithms by well-known tiling or divide-and-conquer techniques (for a survey see our 2014 Acta Numerica paper by Ballard/Carson/Demmel/Hoemmen/ Knight/Schwartz).
Now suppose that we want all the dot products accumulated by these algorithms to be accumulated in a long accumulator of length R words. For simplicity consider the case of matrix multiply C=A*B: we need a different accumulator for each entry of a tile of C, in order to reuse the entries of the tiles of A and B in fast memory. So the lower bound analysis may be modified to prove that the lower bound increases to Omega(sqrt(R)*#flops/sqrt(M)) i.e. the memory traffic increases by a factor of sqrt(R). And this may be attained by a different choice of tiles sizes.
So for an Exact Dot Product, where R >= 64, that means at least a 8>=sqrt(R) fold increase in memory traffic.
For other "nested loop" computations, there is a similar analysis, with the memory traffic increasing by R to some power, where the exponent depends on the what the nested loops look like. For example, for the direct N-body problem, the memory traffic would increase by a factor of R, so even worse than linear algebra.
I wrote this up informally recently, with a related but different motivation, and am meeting with some collaborators to go over the proof tomorrow morning. I had hoped to have this meeting before sharing with the 1788 committee, but since Baker posed the question, I thought I'd respond. I will let you know whether or not the proof survives tomorrows conversation :) .
Jim
On 1/24/16 1:26 PM, Ralph Baker Kearfott wrote:
HMMM. What about memory access bandwidth and speed, and its impact on the overall speeds of single versus double precision?
Baker
--
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch
KIT Distinguished Senior Fellow
Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/
KIT - Universität des Landes Baden-Württemberg
und nationales Großforschungszentrum in der
Helmholtz-Gesellschaft
|