Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02



Dear colleagues,

I wonder whether this computing model is realistic. I agree that interrupting a computation may be necessary, on slow speed computers even during a dot product computation.
But I have doubts whether this is necessary on a modern computer. Let us assume that the two vectors have 10 to the power 5 components (/10^{5}), what is quite a large number. Then computing a dot product requires twice as many arithmetic operations. By pipelining this number even can be reduced by 1/2. On a modern processor that performs 10 to the power of 15 operations (/10^15) in a second these can be done in 1devided by (10 to the power 10) of a second (/1/10^{10}).


Best regards
Ulrich



Am 28.01.2016 um 02:38 schrieb David Biancolin:
(I'm one of the Berkeley students that implemented the hardware EDP accelerator) 

It seems to me that an interruptable EDP is absolutely necessary; as you point out we definitely need to allow the OS to preempt the currently running thread, and moreover loads/stores generated by the EDP unit itself could generate exceptions of their own. I don't think it is particularly difficult to enable restartable or precise exception behavior in the unit -- yes this might require writing out the state of the accumulator to memory, but this is something we already need to support for the rest of the architectural state of the machine. For more aggressive machines with vector or wide packed-SIMD extensions, the state of the register files is much greater than that of the accumulator. We just need to handle the accumulator as we would vector register state. 

Indeed, i don't think there will be a place for EDP in small cores (ex. the 35K transistor count cores you mentioned) -- however these microcontrollers often lack even a floating point unit (for precisely the reason you gave). 

- D 

On Wed, Jan 27, 2016 at 3:25 AM, David Lester <dlester@xxxxxxxxxxxx> wrote:
Ulrich,

If I understand what you have previously written, what you envisage is that
an EDP for an arbitrary length pair of vectors will be executed to completion
without interrupts. This makes the execution of real-time devices impossible.
It also messes up the OS rendering of the screen on your desk-top/lap-top.

So, what I think you want is that there is the possibility of interrupts after
each pair of float/double data reads. In which case, for a  general purpose
supercomputer, the long accumulator needs to be flush-able so that the
Operating System scheduler can schedule someone else’s EDP-ridden
program. And so forth. And because there will always be only a limited
number of long accumulators on a processor, this will inevitably — in
the worst case — cause flushing to main-memory.

A typical minimal modern processor (ARM6) has only 35,000 transistors.
Each bit of SRAM (your long accumulator) has 6 transistors. Thus just six of
your long accumulators will _double_ the size of the core’s foot-print.
Personally I’d go with the idea of using all that extra SRAM in a more
general way as caches or scratchpad, but that’s just me.

The alternative is that we are considering a specialised piece of single-user
hardware which is there only for Matrix/Vector processing. If that’s the case
then by all means build it — it won’t cost much, say $0.5-1M. A sort of
bolt-on hardware accelerator, as it were.

But I’m struggling to see the usefulness of EDP in a standard for general
purpose processors.

Dave Lester


On 27 Jan 2016, at 09:45, Ulrich Kulisch <ulrich.kulisch@xxxxxxx> wrote:

Dear David,

Am 26.01.2016 um 15:26 schrieb David Lester:
Dear Ulrich,

You appear to have forgotten our previous discussion.

What you are actually trading is latency on interrupts vs speed on 
a highly specialised bit of hardware.

If the EDP is to be a general operation it has to be possible
for the code to be interrupted. If this code is to be re-entrant
then the entire 1024 bit accumulator(s?) have to be flushed
to interrupt stack.

In my book Computer Arithmetic and Validity, the possibility of interrupting a dot product computation is considered. See Figures 8.17 and 8.18 in the second edition and the text around these figures.

However, I am of the opinion that a dot product computation never should and never needs to be interrupted. I repeat from my mail below:

The simplest and fastest way for computing a dot product is to compute it exactly. By pipelining, it can be computed in the time the processor needs to read the data, i.e., it comes with utmost speed.  

So the question is: would you interrupt reading the data into the processor. I think if another computation really has higher priority (what I doubt) it would be better to place the interrupt before the dot product operation.




-- 
Karlsruher Institut für Technologie (KIT)
Institut für Angewandte und Numerische Mathematik
D-76128 Karlsruhe, Germany
Prof. Ulrich Kulisch
KIT Distinguished Senior Fellow
 
Telefon: +49 721 608-42680
Fax: +49 721 608-46679
E-Mail: ulrich.kulisch@xxxxxxx
www.kit.edu
www.math.kit.edu/ianm2/~kulisch/

KIT - Universität des Landes Baden-Württemberg 
und nationales Großforschungszentrum in der 
Helmholtz-Gesellschaft