Re: Please listen to Ulrich here...

Thread Links	Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

Thread Links

Date Links

Dear David,

I thank you again for your very interesting mail. It shows indeed that the long accumulator method is most energy efficient. Compared with computing a dot product by the conventional method in floating-point arithmetic it avoids a large number of data movements in the computer. For accumulation of a product just the two factors for the product have to be read. The product then is added to the long accumulator on the arithmetic unit. If the data are distributed on different processors, parts of the dot product can be computed on different processors. Then, at the end of the accumulation of products only the long results have to be moved and added to the final result of the dot product.

This is a very fast process, and as a side effect, the result is exact. No error analysis is necessary. All that is needed is a tiny local memopry on the arithmetic unit. That for IEEE 754 double precision about 4000 bits are needed is due to the large exponent range of this data format. For interval arithmetic a much smaller exponent range would suffice. For the /370 data format long only about 1200 bits were needed and I do not remember a single application where this was not enough. For IEEE 754 single precision about 600 bit suffice.

The technique described above was well established on old calculators (1935 and earlier). I attach pictures of four of such calculators. Several of these provided two long registers. This allowed accumulation of two long sums or dot products simultaneously. The technique of using a long accumulator can be traced back to the early computer by G. W. Leibniz 1693 (the picture shows 8 (decimal) digits for the summands and 16 digits for the long accumulator). In the old days this technique was offered and recommended as being the most effective way of using a computer. And this is still true today. It actually is the most natural way of accumulating numberts or products.

With best regards
Ulrich

Am 11.08.2013 11:43, schrieb David Lester:

Ulrich, Here are some energy consumption figures for typical operations on 28nm geometry, running at 800mV: 28nm (0.8V) 16b integer MADD 1pJ // NB combined mul/add 32b integer MADD 4pJ 32b floating-point MAF 6pJ // NB combined mul/add 64b floating-point MAF 20pJ Read from on-chip SRAM 1.5pJ/byte Read from co-packaged DRAM 50pJ/byte Read from off chip DRAM 250pJ/byte Read from SSD (6Gbps SATA) 5000pJ/Byte 20mm wires (50% transitions) 7pJ/Byte chip-chip parallel link 25pJ/Byte " " serial " 50pJ/Byte And to give you an idea of scale: 1pJ is about 60,000 electron volts (though not strictly true, we could imagine each 16 bit arithmetic operation involves about 38,000 electrons). As you can see, once we have moved data to the local SRAM, we want to do as much as possible with it. Indeed it is better to think of programming the next generation of devices as if they were GPU-like, rather than as a single processor. Our target is 256 cores per 1cm^2 die -- provided we can get enough SRAM to keep the CPUs fed -- running at 1W per chip. Clocked at ~ 0.5-1.0 GHz. Co-packaged DRAM: as much as we can squeeze in. Ideally, each of our cores -- or at least groups of 8 -- will be running the same instruction on different data (SIMD). ----------------------------- Now, as I understand it, you are suggesting we trade core numbers for complexity. Can I suggest we might get 64 complex cores? What the energy numbers above are indicating is that we need to pay great attention to the movement of data; the operations themselves are of secondary importance. For example, it costs about 4 times as much energy to get a word from one side of the chip to the other than to do a pair of 32 bit float operations. Now, to get the best performance out of this device, we will ideally need a compiler that optimises against the energy consumption. Since we do not yet have the device, we've not built the compiler, so I have not run the numbers. Depending on how we are using the dot-product, it may make sense to pre-load a pair of matrices into the locations needed. Alternatively, we could think about the dataflow through the device. Personally, I think I'd be looking at getting a parallel version of GMP up and running early, and using this as a vehicle for expressing a large fix-point accumulator across multiple CPUs. I'm prepared to concede that this may not prove to be the best solution, but the cost of getting all the floats to one CPU for a dot-product operation may be prohibitive. By the way, I am in no way opposed to the idea of high accuracy arithmetic, but I _am_ suggesting that the best way to achieve this might involve a combination of hardware and software; and that it is premature to be suggesting/mandating/dictating one particular hardware solution given the changing nature of the hardware landscape. In particular, one of the challenges is devising programming models and programming languages capable of making use of "many-core" architectures (think: 250-1000 cores). Regards, DAve Lester

-- Karlsruher Institut für Technologie (KIT) Institut für Angewandte und Numerische Mathematik D-76128 Karlsruhe, Germany Prof. Ulrich Kulisch Telefon: +49 721 608-42680 Fax: +49 721 608-46679 E-Mail: ulrich.kulisch@xxxxxxx www.kit.edu www.math.kit.edu/ianm2/~kulisch/ KIT - Universität des Landes Baden-Württemberg und nationales Großforschungszentrum in der Helmholtz-Gesellschaft