Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: IEEEP1788



> On 30 Apr 2015, at 21:42, Vincent Lefevre <vincent@xxxxxxxxxx> wrote:
> 
> On 2015-04-30 17:49:51 +0200, Ulrich Kulisch wrote:
>> It computes the exact dot product totally on chip without any memory
>> involvement.
> 
> You need memory on chip for the long accumulator. At least one for
> each core.

Just so.

My suggestion to Ulrich is the following approach (which we use in SpiNNaker/Human Brain Project):

Use a minimal processor, attach a small amount of instruction memory (16-32K), and data memory (32-64K) in a Harvard configuration (separate instruction/data paths). That’s your processing node.

Attachment: PastedGraphic-1.pdf
Description: Adobe PDF document


As you can see there are 6 banks of 16K Static RAM per core. This is external to the CPU, but internal to the chip, which has 18 processing nodes. Each of the chips has a 128MB DRAM folded in (the dark 2/3 overlaying the gold substrate) prior to packaging:


Attachment: PastedGraphic-2.pdf
Description: Adobe PDF document



To give you all a feel for the exascale problem, the only processor on the market capable of hitting the 10MW power budget is the ARM Cortex M0+ with 28nm technology (you may be able to do something with 14nm FINFET, but that’s Intel proprietary at the moment).

Here’s wikipedia’s Cortex M-series description:

Cortex-M Thumb  Thumb-2 H/W multiply H/W divide Saturated math DSP insts Optional FPU architecture
————————————————————————————————————————————————————————————————————————————————————————————————————————
M0       Most   Some    32-bit       No         No             No        No           ARMv6-M
M0+      Most   Some    32-bit       No         No             No        No           ARMv6-M
M1       Most   Some    32-bit       No         No             No        No           ARMv6-M
M3       Entire Entire  32/64-bit    Yes        Yes            No        No           ARMv7-M
M4       Entire Entire  32/64-bit    Yes        Yes            Yes       SP           ARMv7E-M
M7       Entire Entire  32/64-bit    Yes        Yes            Yes       SP/(SP & DP) ARMv7E-M

Currently we use a pre-Cortex processor core, but will be switching to M4F (which has the optional non-IEEE754 compliant single precision FP unit). We briefly thought about the new M7, but the chip area is about double the M4F, and we therefore can’t afford the area, or power. If you’ve been reading carefully, you’ll have realised we’ll miss exascale performance (“the M0+ is the only processor which’ll hit the 10MW power budget”).[*]

What I am advising Ulrich is that he treats the accumulator of his complete arithmetic as something that lives in the on-chip, off-core static RAM, and that his calculations are undertaken by a primitive processor on this data under software control. This permits the same code to control your fridge (“Internet of Things”) and your supercomputer (where you’ll get 1e9 of these little processor+memory units).

.. And it’s the Fridge/Microwave/Central Heating controller which is subsidising supercomputers, not the other way around. Quite simply there are more fridges than supercomputer data centres.

Regards,

Dave Lester

[*] And to bring it back to arithmetic, we’re adding a fixed point exp function, as a special feature.
    It’s outside the ARM core, but analysis shows its about 9% of the instruction mix in our simulation.