Re: Please listen to Ulrich here... // Just to you

Thread Links	Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

Thanks, Richard.

I agree that computing Y' = Union (f(A1),f(A2), .... f(Az)) might produce a tighter enclosure of Y than a naive interval implementation of f(A). However, I don't see why a long accumulator is required to subdivide the original interval A or to compute Y'.

Cheers,

Bill

On 8/12/13 2:34 PM, Richard Fateman wrote:

Bill:

Just a thought about how one might generate computations with non-degenerate intervals, but
possibly not resulting in substantially more accurate etc etc.

Say you have a wide interval A, and wish to compute Y = f(A) where you have a program f.

You are unhappy with the width of Y.
You divide the interval A = [a0, az] into sub-intervals A1=[a0,a1], A2=[a1,a2], A3 =[a2,a3] .... ,Az=[ay,az],

compute Y' = Union (f(A1),f(A2), .... f(Az)) which in some circumstances might be much tighter than Y.

The intervals that do not include the original endpoints, e.g. [a1,a2] can have exact endpoints.

As I said, just a thought.

RJF

On 8/12/2013 1:49 PM, G. William (Bill) Walster wrote:
Ulrich,

Is your point that if floating-point arithmetic had never been introduced and fixed-point (or equivalently, rational) arithmetic had remained the only way to compute with rational numbers, then dealing with rounding errors would not have been a motivation for introducing and developing interval arithmetic? Of course, fixed-point or rational interval arithmetic could still have been introduced to compute bounds on the results of computing with non-rational real numbers. But, more likely, this would have been done by simply increasing the accuracy of such number's fixed-point approximation.

While this is an interesting thing about which to speculate, the utility of computing with intervals is now far greater than simply to bound floating-point rounding errors. Interval arithmetic can and should be used to bound errors in measured inputs to computations, something that a fixed-point system can do with an ULP of a fixed-point number used to represent an interval. However, in this case the utility of a long accumulator seems questionable.

So, to clarify, can you provide a real example of a computation with non-degenerate interval inputs where computing with exact input interval endpoints enables the exact non-degenerate interval result to be substantially more accurately computed than it is with exact floating-point representations of input interval endpoints?

I know of none.

Thanks in advance,

Bill

On 8/12/13 7:24 AM, Ulrich Kulisch wrote:
Dear David,

I thank you again for your very interesting mail. It shows indeed that the long accumulator method is most energy efficient. Compared with computing a dot product by the conventional method in floating-point arithmetic it avoids a large number of data movements in the computer. For accumulation of a product just the two factors for the product have to be read. The product then is added to the long accumulator on the arithmetic unit. If the data are distributed on different processors, parts of the dot product can be computed on different processors. Then, at the end of the accumulation of products only the long results have to be moved and added to the final result of the dot product.

This is a very fast process, and as a side effect, the result is exact. No error analysis is necessary. All that is needed is a tiny local memopry on the arithmetic unit. That for IEEE 754 double precision about 4000 bits are needed is due to the large exponent range of this data format. For interval arithmetic a much smaller exponent range would suffice. For the /370 data format long only about 1200 bits were needed and I do not remember a single application where this was not enough. For IEEE 754 single precision about 600 bit suffice.

The technique described above was well established on old calculators (1935 and earlier). I attach pictures of four of such calculators. Several of these provided two long registers. This allowed accumulation of two long sums or dot products simultaneously. The technique of using a long accumulator can be traced back to the early computer by G. W. Leibniz 1693 (the picture shows 8 (decimal) digits for the summands and 16 digits for the long accumulator). In the old days this technique was offered and recommended as being the most effective way of using a computer. And this is still true today. It actually is the most natural way of accumulating numberts or products.

With best regards
Ulrich

Am 11.08.2013 11:43, schrieb David Lester:
Ulrich,

Here are some energy consumption figures for typical operations on 28nm geometry, running at 800mV:

28nm (0.8V)
	
16b integer MADD                1pJ          // NB combined mul/add
32b integer MADD                4pJ
32b floating-point MAF          6pJ          // NB combined mul/add
64b floating-point MAF         20pJ
Read from on-chip SRAM          1.5pJ/byte
Read from co-packaged DRAM     50pJ/byte
Read from off chip DRAM       250pJ/byte
Read from SSD (6Gbps SATA)   5000pJ/Byte
20mm wires (50% transitions)    7pJ/Byte
chip-chip parallel link        25pJ/Byte
  "    "   serial   "          50pJ/Byte

And to give you an idea of scale: 1pJ is about 60,000 electron volts (though not strictly true, we
could imagine each 16 bit arithmetic operation involves about 38,000 electrons).

As you can see, once we have moved data to the local SRAM, we want to do as much as possible with it.
Indeed it is better to think of programming the next generation of devices as if they were GPU-like,
rather than as a single processor.

Our target is 256 cores per 1cm^2 die -- provided we can get enough SRAM to keep the CPUs fed --
running at 1W per chip. Clocked at ~ 0.5-1.0 GHz. Co-packaged DRAM: as much as we can squeeze in.

Ideally, each of our cores -- or at least groups of 8 -- will be running the same instruction on
different data (SIMD).

-----------------------------

Now, as I understand it, you are suggesting we trade core numbers for complexity. Can I suggest we
might get 64 complex cores?

What the energy numbers above are indicating is that we need to pay great attention to the movement
of data; the operations themselves are of secondary importance. For example, it costs about 4 times
as much energy to get a word from one side of the chip to the other than to do a pair of 32 bit float
operations.

Now, to get the best performance out of this device, we will ideally need a compiler that optimises
against the energy consumption. Since we do not yet have the device, we've not built the compiler, so
I have not run the numbers. Depending on how we are using the dot-product, it may make sense to
pre-load a pair of matrices into the locations needed. Alternatively, we could think about the
dataflow through the device.

Personally, I think I'd be looking at getting a parallel version of GMP up and running early, and using this as a vehicle for expressing a large fix-point accumulator across multiple CPUs. I'm prepared to concede that this may not prove to be the best solution, but the cost of getting all the floats to one CPU for a dot-product operation may be prohibitive.

By the way, I am in no way opposed to the idea of high accuracy arithmetic, but I _am_ suggesting that
the best way to achieve this might involve a combination of hardware and software; and that it is
premature to be suggesting/mandating/dictating one particular hardware solution given the changing
nature of the hardware landscape. In particular, one of the challenges is devising programming models
and programming languages capable of making use of "many-core" architectures (think: 250-1000 cores).

Regards,

DAve Lester
-- 
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch

Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg 
und nationales Großforschungszentrum in der 
Helmholtz-Gesellschaft