Re: Please listen to Ulrich here...
As a supercomputer designer http://apt.cs.man.ac.uk/projects/SpiNNaker/SpiNNchip/, with a new
€1 billion funding stream as part of http://www.humanbrainproject.eu/, I had been intending to
sit this one out.
The problem as I see it is that Ulrich is designing a machine for the 1980s, and utterly failing
to address the concerns of anyone actually designing or building a machine today. And that concern
is energy.
To drive energy costs down low enough to hit the 20-25MW power budget for exascale, we need to
maximise the number of compute cores (at the expense of their complexity, and clock speed),
localise memory, and cut out extraneous operations. The significance of the energy budget is
that with today's cutting edge Intel hardware, you would need order $100 billion pa to pay the
electricity bills.
Most designers use stock Intel chips, because there is generally thought to be insufficient
market for supercomputer chip-development to be cost-effective. Because ARM is both
energy-efficient, and a virtual foundry (and Steve is the original ARM designer), we have the
option to build a custom chip, unlike many of our competitors. ARM, by the way, is outselling
Intel to the extent that last year it sold over 15 billion (licensed) cores which is greater
than the entirety of all other manufacturers' chip production since 1967.
Talking within the SC community it seems we are all converging on essentially the same solution:
(1) Energy efficient low-performance cores.
(2) Co-packaged DRAM, including multiple layers of 3D stacking.
(3) Vectorized SIMD architecture.
(4) Paring down the complexity of the individual cores, and doing rarely used operations in
software, (or perhaps adding co-processors).
The argument against co-processors is that they consume energy, and die-area, that could be
more productively used to perform common rather than unusual computations.
For example, our partner team at FZ Julich (German Supercomputer Centre) has reported that
their current supercomputer neural simulations have a frighteningly low use of the FPU.
Most of what is needed for our application area is memory and fast MPI. This is what
SpiNNaker is optimised for. It has no FP _at_all_, has local co-packaged DRAM, and a fast
packet switched network optimised for the task in hand (lots and lots of very small packets
(4 bytes), rather than a reasonable amount of 1-2K packets which is indicated by MPI).
I'll say that again: no FPU. It's not heavily used, and it's cheaper and more energy
efficient to do a software simulation of float and double, which is a reasonable architectural
compromise. Obviously the LINPACK numbers are non-competitive, but there is a growing
realisation that LINPACK may not be very indicative of future super-computing needs.
The question -- as I see it -- is not can an exact dot product be done in hardware, but can it be done cost-effectively? That is, in a world in which volume trumps everything, is there a large enough market for the hardware development costs to be amortised over likely sales? Otherwise, letting you clever people write a software emulation would seem the most cost effective way to go.
I should point out how the current ARM co-processor architecture violates everything that Ulrich
believes is desirable. The FPU is not integrated into the main ALU pathway. Instead the opcode is
trapped-out as "unrecognised", and offered to the coprocessors attached to the internal bus.
Software compatibility is maintained by raising an interrupt if the co-processor is missing, and
executing a software simulation of the operation.
So my question is: is our proposed chip P1788 compliant? Provided that we do everything in software, I cannot see why it should not be. Am I missing something?
Are you requiring me to provide on-chip co-processors to be compliant? If so, how does this differ from off chip interrupt handling for unrecognised op-codes? Are you mandating how the FPU is integrated into the ALU? If so, do you have sufficient current architectural experience to make sensible selections?
Regards,
Dave Lester
Advanced Processor Technology Group
The University of Manchester
Manchester M13 9PL
ps Baker: If I don't have current listserv permissions, please feel free to forward to the community.
On 8 Aug 2013, at 01:56, Vincent Lefevre wrote:
> On 2013-08-07 02:24:28 -0700, Dan Zuras Intervals wrote:
>> When Ulrich talks about problems with exact dot product he has some
>> experience in the matter. More than the rest of us put together.
>> If we are about to have a standard that admits the possibility of
>> a member that CANNOT do an exact EDP, all our work will be wasted.
>> Please consider rewording this document in Ulrich's favor this time.
>> He is not just blowing smoke. It is hard but it is necessary. Or,
>> at least if we make it so.
>
> I've already said that in the list in the past, but what Ulrich didn't
> consider is multiple precision. He didn't show that multiple precision
> was not sufficient, and there's no reason why it wouldn't be.
>
> Then I agree with Paul. Anyone may have his own wishes for his own
> applications. For instance, why not some some well-known floating-point
> primitives, such as a TwoSum operation (you may remember that some
> variant of this was considered in the revision of IEEE 754 and it was
> removed just because of under-specification of exceptions[*], something
> that never occurs in codes seen in practice). And FYI, evaluation of
> math function is in every system library (and some of these functions
> are even required by P1788), so that such primitives may be useful to
> many users. Much more than CA/EDP.
>
> [*] http://grouper.ieee.org/groups/754/email/msg02209.html
>
> Now, I wouldn't like P1788 to deviate from interval arithmetic; let's
> recall that the PAR title is "Standard for Interval Arithmetic" and
> its scope is also exclusively on interval arithmetic.
>
> If Ulrich wants CA to be implemented (in hardware), he can talk to a
> processor vendor and try to convince him. He doesn't need a standard
> for that. And if it is so useful, then other vendors will implement
> it as well...
>
> Also note that CA takes resources. Some processors, like the Itanium,
> do not even have a division instruction.
>
> --
> Vincent Lefèvre <vincent@xxxxxxxxxx> - Web: <http://www.vinc17.net/>
> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)