Re: Please listen to Ulrich here...
Ulrich,
Here are some energy consumption figures for typical operations on 28nm geometry, running at 800mV:
28nm (0.8V)
16b integer MADD 1pJ // NB combined mul/add
32b integer MADD 4pJ
32b floating-point MAF 6pJ // NB combined mul/add
64b floating-point MAF 20pJ
Read from on-chip SRAM 1.5pJ/byte
Read from co-packaged DRAM 50pJ/byte
Read from off chip DRAM 250pJ/byte
Read from SSD (6Gbps SATA) 5000pJ/Byte
20mm wires (50% transitions) 7pJ/Byte
chip-chip parallel link 25pJ/Byte
" " serial " 50pJ/Byte
And to give you an idea of scale: 1pJ is about 60,000 electron volts (though not strictly true, we
could imagine each 16 bit arithmetic operation involves about 38,000 electrons).
As you can see, once we have moved data to the local SRAM, we want to do as much as possible with it.
Indeed it is better to think of programming the next generation of devices as if they were GPU-like,
rather than as a single processor.
Our target is 256 cores per 1cm^2 die -- provided we can get enough SRAM to keep the CPUs fed --
running at 1W per chip. Clocked at ~ 0.5-1.0 GHz. Co-packaged DRAM: as much as we can squeeze in.
Ideally, each of our cores -- or at least groups of 8 -- will be running the same instruction on
different data (SIMD).
-----------------------------
Now, as I understand it, you are suggesting we trade core numbers for complexity. Can I suggest we
might get 64 complex cores?
What the energy numbers above are indicating is that we need to pay great attention to the movement
of data; the operations themselves are of secondary importance. For example, it costs about 4 times
as much energy to get a word from one side of the chip to the other than to do a pair of 32 bit float
operations.
Now, to get the best performance out of this device, we will ideally need a compiler that optimises
against the energy consumption. Since we do not yet have the device, we've not built the compiler, so
I have not run the numbers. Depending on how we are using the dot-product, it may make sense to
pre-load a pair of matrices into the locations needed. Alternatively, we could think about the
dataflow through the device.
Personally, I think I'd be looking at getting a parallel version of GMP up and running early, and using this as a vehicle for expressing a large fix-point accumulator across multiple CPUs. I'm prepared to concede that this may not prove to be the best solution, but the cost of getting all the floats to one CPU for a dot-product operation may be prohibitive.
By the way, I am in no way opposed to the idea of high accuracy arithmetic, but I _am_ suggesting that
the best way to achieve this might involve a combination of hardware and software; and that it is
premature to be suggesting/mandating/dictating one particular hardware solution given the changing
nature of the hardware landscape. In particular, one of the challenges is devising programming models
and programming languages capable of making use of "many-core" architectures (think: 250-1000 cores).
Regards,
DAve Lester
On 10 Aug 2013, at 19:48, Ulrich Kulisch wrote:
> Dear David,
>
> thank you for your interesting mail. I fully agree with you that most computer applications do not need floating-point arithmetic and can sufficiently be dealt with more energy efficient processors than those used in a PC today. Of course, these simpler processors also can be used for floating-point arithmetic and scientific computing including interval arithmetic. In the late 1970ies we implemented our PASCAL-XSC on the Z80 which just provided an 8 bit adder. I recommend everybody to have a look at this language! It was one of the more powerful programming languages that was available for scientific computing. But running it on the Z80 certainly was not energy efficient. So the question remains how scientific computing can be made more energy efficient.
>
> < The problem as I see it is that Ulrich is designing a machine for the 1980s, and utterly failing to address the concerns of
> < anyone actually designing or building a machine today. And that concern is energy.
> My understanding of the situation is not so much influenced by the computer of the 1980ies but by those that were used 50 years earlier. The better computers used in the 1930ies and earlier provided a long accumulator. It allowed error free accumulation of numbers and of products of numbers into a wide fixed-point register. Several of these computers provided more than one such register. This method is simpler and more energy efficient than doing a dot product in the conventional way in floating-point arithmetic. No intermediate normalizations and roundings have to be performed. No intermediate results have to be stored and read in again for the next operation. No intermediate overflow and underflow can occur and has to be checked for. No error analysis is necessary. The result is always exact. The operations are just: multiply, shift and add.
>
> With best regards
> Ulrich
>
>
>
>
> Am 08.08.2013 11:41, schrieb David Lester:
>> As a supercomputer designer http://apt.cs.man.ac.uk/projects/SpiNNaker/SpiNNchip/
>> , with a new
>> €1 billion funding stream as part of
>> http://www.humanbrainproject.eu/
>> , I had been intending to
>> sit this one out.
>>
>> The problem as I see it is that Ulrich is designing a machine for the 1980s, and utterly failing
>> to address the concerns of anyone actually designing or building a machine today. And that concern
>> is energy.
>>
>> To drive energy costs down low enough to hit the 20-25MW power budget for exascale, we need to
>> maximise the number of compute cores (at the expense of their complexity, and clock speed),
>> localise memory, and cut out extraneous operations. The significance of the energy budget is
>> that with today's cutting edge Intel hardware, you would need order $100 billion pa to pay the
>> electricity bills.
>>
>> Most designers use stock Intel chips, because there is generally thought to be insufficient
>> market for supercomputer chip-development to be cost-effective. Because ARM is both
>> energy-efficient, and a virtual foundry (and Steve is the original ARM designer), we have the
>> option to build a custom chip, unlike many of our competitors. ARM, by the way, is outselling
>> Intel to the extent that last year it sold over 15 billion (licensed) cores which is greater
>> than the entirety of all other manufacturers' chip production since 1967.
>>
>> Talking within the SC community it seems we are all converging on essentially the same solution:
>>
>> (1) Energy efficient low-performance cores.
>>
>> (2) Co-packaged DRAM, including multiple layers of 3D stacking.
>>
>> (3) Vectorized SIMD architecture.
>>
>> (4) Paring down the complexity of the individual cores, and doing rarely used operations in
>> software, (or perhaps adding co-processors).
>>
>> The argument against co-processors is that they consume energy, and die-area, that could be
>> more productively used to perform common rather than unusual computations.
>>
>> For example, our partner team at FZ Julich (German Supercomputer Centre) has reported that
>> their current supercomputer neural simulations have a frighteningly low use of the FPU.
>> Most of what is needed for our application area is memory and fast MPI. This is what
>> SpiNNaker is optimised for. It has no FP _at_all_, has local co-packaged DRAM, and a fast
>> packet switched network optimised for the task in hand (lots and lots of very small packets
>> (4 bytes), rather than a reasonable amount of 1-2K packets which is indicated by MPI).
>>
>> I'll say that again: no FPU. It's not heavily used, and it's cheaper and more energy
>> efficient to do a software simulation of float and double, which is a reasonable architectural
>> compromise. Obviously the LINPACK numbers are non-competitive, but there is a growing
>> realisation that LINPACK may not be very indicative of future super-computing needs.
>>
>> The question -- as I see it -- is not can an exact dot product be done in hardware, but can it be done cost-effectively? That is, in a world in which volume trumps everything, is there a large enough market for the hardware development costs to be amortised over likely sales? Otherwise, letting you clever people write a software emulation would seem the most cost effective way to go.
>>
>> I should point out how the current ARM co-processor architecture violates everything that Ulrich
>> believes is desirable. The FPU is not integrated into the main ALU pathway. Instead the opcode is
>> trapped-out as "unrecognised", and offered to the coprocessors attached to the internal bus.
>> Software compatibility is maintained by raising an interrupt if the co-processor is missing, and
>> executing a software simulation of the operation.
>>
>> So my question is: is our proposed chip P1788 compliant? Provided that we do everything in software, I cannot see why it should not be. Am I missing something?
>>
>> Are you requiring me to provide on-chip co-processors to be compliant? If so, how does this differ from off chip interrupt handling for unrecognised op-codes? Are you mandating how the FPU is integrated into the ALU? If so, do you have sufficient current architectural experience to make sensible selections?
>>
>> Regards,
>>
>> Dave Lester
>>
>> Advanced Processor Technology Group
>> The University of Manchester
>> Manchester M13 9PL
>>
>> ps Baker: If I don't have current listserv permissions, please feel free to forward to the community.
>>
>>
>
> --
> Karlsruher Institut für Technologie (KIT)
> Institut für Angewandte und Numerische Mathematik
> D-76128 Karlsruhe, Germany
> Prof. Ulrich Kulisch
>
> Telefon: +49 721 608-42680
> Fax: +49 721 608-46679
> E-Mail:
> ulrich.kulisch@xxxxxxx
> www.kit.edu
> www.math.kit.edu/ianm2/~kulisch/
>
>
> KIT - Universität des Landes Baden-Württemberg
> und nationales Großforschungszentrum in der
> Helmholtz-Gesellschaft
>