Ulrich,
Is your point that if floating-point arithmetic had never been
introduced and fixed-point (or equivalently, rational) arithmetic
had remained the only way to compute with rational numbers, then
dealing with rounding errors would not have been a motivation for
introducing and developing interval arithmetic? Of course,
fixed-point or rational interval arithmetic could still have been
introduced to compute bounds on the results of computing with
non-rational real numbers. But, more likely, this would have been
done by simply increasing the accuracy of such number's
fixed-point approximation.
While this is an interesting thing about which to speculate, the
utility of computing with intervals is now far greater than simply
to bound floating-point rounding errors. Interval arithmetic can
and should be used to bound errors in measured inputs to
computations, something that a fixed-point system can do with an
ULP of a fixed-point number used to represent an interval.
However, in this case the utility of a long accumulator seems
questionable.
So, to clarify, can you provide a real example of a computation
with non-degenerate interval inputs where computing with exact
input interval endpoints enables the exact non-degenerate interval
result to be substantially more accurately computed than it is
with exact floating-point representations of input interval
endpoints?
I know of none.
Thanks in advance,
Bill
On 8/12/13 7:24 AM, Ulrich Kulisch wrote:
Dear
David,
I thank you again for your very interesting mail. It shows
indeed that the long accumulator method is most energy
efficient. Compared with computing a dot product by the
conventional method in floating-point arithmetic it avoids a
large number of data movements in the computer. For
accumulation of a product just the two factors for the product
have to be read. The product then is added to the long
accumulator on the arithmetic unit. If the data are
distributed on different processors, parts of the dot product
can be computed on different processors. Then, at the end of
the accumulation of products only the long results have to be
moved and added to the final result of the dot product.
This is a very fast process, and as a side effect, the result
is exact. No error analysis is necessary. All that is needed
is a tiny local memopry on the arithmetic unit. That for IEEE
754 double precision about 4000 bits are needed is due to the
large exponent range of this data format. For interval
arithmetic a much smaller exponent range would suffice. For
the /370 data format long only about 1200 bits were needed and
I do not remember a single application where this was not
enough. For IEEE 754 single precision about 600 bit suffice.
The technique described above was well established on old
calculators (1935 and earlier). I attach pictures of four of
such calculators. Several of these provided two long
registers. This allowed accumulation of two long sums or dot
products simultaneously. The technique of using a long
accumulator can be traced back to the early computer by G. W.
Leibniz 1693 (the picture shows 8 (decimal) digits for the
summands and 16 digits for the long accumulator). In the old
days this technique was offered and recommended as being the
most effective way of using a computer. And this is still true
today. It actually is the most natural way of accumulating
numberts or products.
With best regards
Ulrich
Am 11.08.2013 11:43, schrieb David Lester:
Ulrich,
Here are some energy consumption figures for typical operations on 28nm geometry, running at 800mV:
28nm (0.8V)
16b integer MADD 1pJ // NB combined mul/add
32b integer MADD 4pJ
32b floating-point MAF 6pJ // NB combined mul/add
64b floating-point MAF 20pJ
Read from on-chip SRAM 1.5pJ/byte
Read from co-packaged DRAM 50pJ/byte
Read from off chip DRAM 250pJ/byte
Read from SSD (6Gbps SATA) 5000pJ/Byte
20mm wires (50% transitions) 7pJ/Byte
chip-chip parallel link 25pJ/Byte
" " serial " 50pJ/Byte
And to give you an idea of scale: 1pJ is about 60,000 electron volts (though not strictly true, we
could imagine each 16 bit arithmetic operation involves about 38,000 electrons).
As you can see, once we have moved data to the local SRAM, we want to do as much as possible with it.
Indeed it is better to think of programming the next generation of devices as if they were GPU-like,
rather than as a single processor.
Our target is 256 cores per 1cm^2 die -- provided we can get enough SRAM to keep the CPUs fed --
running at 1W per chip. Clocked at ~ 0.5-1.0 GHz. Co-packaged DRAM: as much as we can squeeze in.
Ideally, each of our cores -- or at least groups of 8 -- will be running the same instruction on
different data (SIMD).
-----------------------------
Now, as I understand it, you are suggesting we trade core numbers for complexity. Can I suggest we
might get 64 complex cores?
What the energy numbers above are indicating is that we need to pay great attention to the movement
of data; the operations themselves are of secondary importance. For example, it costs about 4 times
as much energy to get a word from one side of the chip to the other than to do a pair of 32 bit float
operations.
Now, to get the best performance out of this device, we will ideally need a compiler that optimises
against the energy consumption. Since we do not yet have the device, we've not built the compiler, so
I have not run the numbers. Depending on how we are using the dot-product, it may make sense to
pre-load a pair of matrices into the locations needed. Alternatively, we could think about the
dataflow through the device.
Personally, I think I'd be looking at getting a parallel version of GMP up and running early, and using this as a vehicle for expressing a large fix-point accumulator across multiple CPUs. I'm prepared to concede that this may not prove to be the best solution, but the cost of getting all the floats to one CPU for a dot-product operation may be prohibitive.
By the way, I am in no way opposed to the idea of high accuracy arithmetic, but I _am_ suggesting that
the best way to achieve this might involve a combination of hardware and software; and that it is
premature to be suggesting/mandating/dictating one particular hardware solution given the changing
nature of the hardware landscape. In particular, one of the challenges is devising programming models
and programming languages capable of making use of "many-core" architectures (think: 250-1000 cores).
Regards,
DAve Lester
--
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch
Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/
KIT - Universität des Landes Baden-Württemberg
und nationales Großforschungszentrum in der
Helmholtz-Gesellschaft
|