David McGrew wrote:
You mean advantage in terms of latency, right? I'm not
sure that this is the case, since both XCB and EME* need to do one
pass over the data before any data can be output, and I suspect that
the circuit depth of those two passes isn't much different. It would
be interesting to see a detailed comparison. For that matter, it
would be worthwhile to discuss the implementation scenarios enough to
get a good idea of what the "success criteria" for wide-block modes
like these are. (E.g. since all of these modes require the data to
be buffered, what critical path should be measured? The path to
output the first byte, or to output all of the bytes?)
I've looked at this to some extent.
From the point of view of an arbitrary block size, XCB is much more
costly. To support a block that is larger than the AES hardware
accelerator's buffer size, data must be fetched twice. This feature is
unique to XCB; I've not seen it in any other mode of any crypto
algorithm I've looked at. Having to fetch data twice is very costly.
For a block size of 4096 bits, it is reasonable to buffer the entire
block within the AES hardware accelerator.
So, assuming a block of 4096 bits (512 bytes), computation times are:
- LRW: 32 AES computations + 1 GFMult
only the first GF Multiply must be performed in serial.
- EME: 66 AES computations
all GF Mults can be parallelized
- XCB: 37 AES computations + 34 GFMult
There are a total of 68 GF mults to perform, split between
two separate GHASH computations. Computation of ciphertext stream B
depends on D, which depends on I, K, and H. Given one AES engine and
one GFMult engine, those first 3 AES computations must occur serially,
followed by 34 GF multiplies, all serially. Only once D is computed
can the AES engine and the multiplier be used concurrently.
- GCM: 34 AES computations + 2 GF multiplies
GHASH(AAD) can't start until AES(0,K) produces H. For a
typical 96 bit AAD, that takes 2 GF multiplies. I'm assuming that the
multiply accumulate of len(A) || len(C) can be performed in parallel
with a rear-loaded AES computation of CTR_0.
- CCM: 66 AES computations
Counter Mode requires 32 AES computations; CBC-MAC requires
32 AES computations, and there are two additional AES computations --
one associated with AAD, and the other to encrypt the MAC.
For LRW, the first GFMultiply must be performed prior to the first AES
computation; all others can be performed in parallel. For EME, it
appears all GFMultiply computations can be performed in parallel. For
XCB, There are a total of 68 GF Multiplies to perform, 35 of which I
think can be parallelized. However, the computation of ciphertext is
dependent on D, which AES(P_0 ^ I, K) ^ ghash(H,Z,B). Computing H
requires an AES computation. Computing I requires an AES computation.
Computing C requires and AES computation. Computing D requires a
GHASH, which is a series of GF-multiply accumulates (34 precisely).
All that MUST be serialized; after D is computed we can start using the
AES engine and the GF engine in parallel.
Lets assume 16 clock cycles per AES computation and 16 clock cycles per
GF Multiply. In that case encrypting takes:
- LRW: 33*16 = 528 clock cycles
- EME: 66*16 = 1056 clock cycles
- XCB: (37+34)*16 = 1136 clock cycles
- GCM: (34+2) * 16 = 576 clock cycles
- CCM: 66*16 = 1056 clock cycles
The 16 clock cycle per AES computation is experiential. The GF
Multiply time is arbitrary -- it assumes a 128x16 multiplier used for
16 cycles because that matches the performance of an AES engine. If
you can afford the die area and power, you can of course make the GF
multiplier much faster.
Note this analysis ignores two factors:
- bus latency -- the time resources are locked while waiting for
data to arrive. If there is a bus latency issue, it is more likely to
be an issue with XCB, where for large data sets data would likely have
to be fetched twice.
- control complexity. LRW is clearly much simpler than either EME
or XCB. EME and XCB both have special challenges to implementation
above CCM or GCM, because of message block scheduling.
mt
|