Bill:
Just a thought about how one might generate computations with
non-degenerate intervals, but
possibly not resulting in substantially more accurate etc etc.
Say you have a wide interval A, and wish to compute Y = f(A)
where you have a program f.
You are unhappy with the width of Y.
You divide the interval A = [a0, az] into sub-intervals
A1=[a0,a1], A2=[a1,a2], A3 =[a2,a3] .... ,Az=[ay,az],
compute Y' = Union (f(A1),f(A2), .... f(Az)) which in some
circumstances might be much tighter than Y.
The intervals that do not include the original endpoints, e.g.
[a1,a2] can have exact endpoints.
As I said, just a thought.
RJF
On 8/12/2013 1:49 PM, G. William (Bill) Walster wrote:
Ulrich,
Is your point that if floating-point arithmetic had never been
introduced and fixed-point (or equivalently, rational)
arithmetic had remained the only way to compute with rational
numbers, then dealing with rounding errors would not have been a
motivation for introducing and developing interval arithmetic?
Of course, fixed-point or rational interval arithmetic could
still have been introduced to compute bounds on the results of
computing with non-rational real numbers. But, more likely,
this would have been done by simply increasing the accuracy of
such number's fixed-point approximation.
While this is an interesting thing about which to speculate, the
utility of computing with intervals is now far greater than
simply to bound floating-point rounding errors. Interval
arithmetic can and should be used to bound errors in measured
inputs to computations, something that a fixed-point system can
do with an ULP of a fixed-point number used to represent an
interval. However, in this case the utility of a long
accumulator seems questionable.
So, to clarify, can you provide a real example of a computation
with non-degenerate interval inputs where computing with exact
input interval endpoints enables the exact non-degenerate
interval result to be substantially more accurately computed
than it is with exact floating-point representations of input
interval endpoints?
I know of none.
Thanks in advance,
Bill
On 8/12/13 7:24 AM, Ulrich Kulisch wrote:
Dear
David,
I thank you again for your very interesting mail. It shows
indeed that the long accumulator method is most energy
efficient. Compared with computing a dot product by the
conventional method in floating-point arithmetic it avoids a
large number of data movements in the computer. For
accumulation of a product just the two factors for the
product have to be read. The product then is added to the
long accumulator on the arithmetic unit. If the data are
distributed on different processors, parts of the dot
product can be computed on different processors. Then, at
the end of the accumulation of products only the long
results have to be moved and added to the final result of
the dot product.
This is a very fast process, and as a side effect, the
result is exact. No error analysis is necessary. All that is
needed is a tiny local memopry on the arithmetic unit. That
for IEEE 754 double precision about 4000 bits are needed is
due to the large exponent range of this data format. For
interval arithmetic a much smaller exponent range would
suffice. For the /370 data format long only about 1200 bits
were needed and I do not remember a single application where
this was not enough. For IEEE 754 single precision about 600
bit suffice.
The technique described above was well established on old
calculators (1935 and earlier). I attach pictures of four of
such calculators. Several of these provided two long
registers. This allowed accumulation of two long sums or dot
products simultaneously. The technique of using a long
accumulator can be traced back to the early computer by G.
W. Leibniz 1693 (the picture shows 8 (decimal) digits for
the summands and 16 digits for the long accumulator). In the
old days this technique was offered and recommended as being
the most effective way of using a computer. And this is
still true today. It actually is the most natural way of
accumulating numberts or products.
With best regards
Ulrich
Am 11.08.2013 11:43, schrieb David Lester:
Ulrich,
Here are some energy consumption figures for typical operations on 28nm geometry, running at 800mV:
28nm (0.8V)
16b integer MADD 1pJ // NB combined mul/add
32b integer MADD 4pJ
32b floating-point MAF 6pJ // NB combined mul/add
64b floating-point MAF 20pJ
Read from on-chip SRAM 1.5pJ/byte
Read from co-packaged DRAM 50pJ/byte
Read from off chip DRAM 250pJ/byte
Read from SSD (6Gbps SATA) 5000pJ/Byte
20mm wires (50% transitions) 7pJ/Byte
chip-chip parallel link 25pJ/Byte
" " serial " 50pJ/Byte
And to give you an idea of scale: 1pJ is about 60,000 electron volts (though not strictly true, we
could imagine each 16 bit arithmetic operation involves about 38,000 electrons).
As you can see, once we have moved data to the local SRAM, we want to do as much as possible with it.
Indeed it is better to think of programming the next generation of devices as if they were GPU-like,
rather than as a single processor.
Our target is 256 cores per 1cm^2 die -- provided we can get enough SRAM to keep the CPUs fed --
running at 1W per chip. Clocked at ~ 0.5-1.0 GHz. Co-packaged DRAM: as much as we can squeeze in.
Ideally, each of our cores -- or at least groups of 8 -- will be running the same instruction on
different data (SIMD).
-----------------------------
Now, as I understand it, you are suggesting we trade core numbers for complexity. Can I suggest we
might get 64 complex cores?
What the energy numbers above are indicating is that we need to pay great attention to the movement
of data; the operations themselves are of secondary importance. For example, it costs about 4 times
as much energy to get a word from one side of the chip to the other than to do a pair of 32 bit float
operations.
Now, to get the best performance out of this device, we will ideally need a compiler that optimises
against the energy consumption. Since we do not yet have the device, we've not built the compiler, so
I have not run the numbers. Depending on how we are using the dot-product, it may make sense to
pre-load a pair of matrices into the locations needed. Alternatively, we could think about the
dataflow through the device.
Personally, I think I'd be looking at getting a parallel version of GMP up and running early, and using this as a vehicle for expressing a large fix-point accumulator across multiple CPUs. I'm prepared to concede that this may not prove to be the best solution, but the cost of getting all the floats to one CPU for a dot-product operation may be prohibitive.
By the way, I am in no way opposed to the idea of high accuracy arithmetic, but I _am_ suggesting that
the best way to achieve this might involve a combination of hardware and software; and that it is
premature to be suggesting/mandating/dictating one particular hardware solution given the changing
nature of the hardware landscape. In particular, one of the challenges is devising programming models
and programming languages capable of making use of "many-core" architectures (think: 250-1000 cores).
Regards,
DAve Lester
--
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch
Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/
KIT - Universität des Landes Baden-Württemberg
und nationales Großforschungszentrum in der
Helmholtz-Gesellschaft
|