Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: p1619 (disk): ciphertext-stealing, tweak-mapping, other



Hi Michael,

thanks for your comments, more inline:

On Dec 21, 2005, at 2:09 PM, Michael Torla wrote:

> David McGrew wrote:
>> You mean advantage in terms of latency, right?  I'm not sure that   
>> this is the case, since both XCB and EME* need to do one pass  
>> over  the data before any data can be output, and I suspect that  
>> the  circuit depth of those two passes isn't much different.  It  
>> would be  interesting to see a detailed comparison.  For that  
>> matter, it would  be worthwhile to discuss the implementation  
>> scenarios enough to get a  good idea of what the "success  
>> criteria" for wide-block modes like  these are.  (E.g. since all  
>> of these modes require the data to be  buffered, what critical  
>> path should be measured?  The path to output  the first byte, or  
>> to output all of the bytes?)
> I've looked at this to some extent.
>
> From the point of view of an arbitrary block size, XCB is much more  
> costly.  To support a block that is larger than the AES hardware  
> accelerator's buffer size, data must be fetched twice.  This  
> feature is unique to XCB; I've not seen it in any other mode of any  
> crypto algorithm I've looked at.

AFAICT, the requirement that the encryptor buffer the block that it  
is encrypting is a fundamental requirement for any cipher that is a  
pseudorandom permutation with an input width that matches the  
plaintext size.  Any mode that met the goal would the need to buffer  
the data.

EME and EME* and several other modes also have this property, IIUC.    
In the EME specifications, the dependancy of the second ECB pass on  
the results of the first ECB pass is somewhat hidden because it is  
expressed indirectly through variables.  Note, for example, on page 4  
of http://seclab.cs.ucdavis.edu/papers/eme.pdf that the variable M,  
which is needed to compute the second ECB pass, is only computed  
after the first ECB pass completes.


> Having to fetch data twice is very costly.

I bet it is!

>
> For a block size of 4096 bits, it is reasonable to buffer the  
> entire block within the AES hardware accelerator.
>

Which would be a significant advantage, AFAIK.

> So, assuming a block of 4096 bits (512 bytes), computation times are:
> LRW:  32 AES computations + 1 GFMult
> only the first GF Multiply must be performed in serial.
> EME:  66 AES computations
> all GF Mults can be parallelized
> XCB:  37 AES computations + 34 GFMult
> There are a total of 68 GF mults to perform, split between two  
> separate GHASH computations.  Computation of ciphertext stream B  
> depends on D, which depends on I, K, and H.  Given one AES engine  
> and one GFMult engine, those first 3 AES computations must occur  
> serially, followed by 34 GF multiplies, all serially.  Only once D  
> is computed can the AES engine and the multiplier be used  
> concurrently.
> GCM:  34 AES computations +  2 GF multiplies
> GHASH(AAD) can't start until AES(0,K) produces H.  For a typical 96  
> bit AAD, that takes 2 GF multiplies.  I'm assuming that the  
> multiply accumulate of len(A) || len(C) can be performed in  
> parallel with a rear-loaded AES computation of CTR_0.
> CCM:  66 AES computations
> Counter Mode requires 32 AES computations; CBC-MAC requires 32 AES  
> computations, and there are two additional AES computations -- one  
> associated with AAD, and the other to encrypt the MAC.
> For LRW, the first GFMultiply must be performed prior to the first  
> AES computation; all others can be performed in parallel.  For EME,  
> it appears all GFMultiply computations can be performed in  
> parallel.  For XCB, There are a total of 68 GF Multiplies to  
> perform, 35 of which I think can be parallelized.  However, the  
> computation of ciphertext is dependent on D, which AES(P_0 ^ I, K)  
> ^ ghash(H,Z,B).  Computing H requires an AES computation.   
> Computing I requires an AES computation.  Computing C requires and  
> AES computation.  Computing D requires a GHASH, which is a series  
> of GF-multiply accumulates (34 precisely).  All that MUST be  
> serialized; after D is computed we can start using the AES engine  
> and the GF engine in parallel.

One useful strategy for implementing XCB is to store the values of H  
and I along with K.  This enables the first ghash invocation and the  
first AES invocation to be computed in parallel.  Of course, it  
requires additional storage.

>
> Lets assume 16 clock cycles per AES computation and 16 clock cycles  
> per GF Multiply.  In that case encrypting takes:
> LRW:  33*16 = 528 clock cycles
> EME:  66*16 = 1056 clock cycles
> XCB:  (37+34)*16 = 1136 clock cycles
> GCM:  (34+2) * 16 = 576 clock cycles
> CCM:  66*16 = 1056 clock cycles
>

Roughly speaking, the cost of non-malleability (psedorandom  
permutation modes) is equal to that of combined encryption &  
authentication modes.

> The 16 clock cycle per AES computation is experiential.  The GF  
> Multiply time is arbitrary -- it assumes a 128x16 multiplier used  
> for 16 cycles because that matches the performance of an AES  
> engine.  If you can afford the die area and power, you can of  
> course make the GF multiplier much faster.
>

Here's my understanding of a hardware XCB implementation, please  
check me if I'm wrong.  For large data sets, the computation is  
dominated by two operations: the initial ghash pass, and the final  
combined counter mode and ghash pass.  So the latency between when  
the data is first provided to the encryptor and when the first data  
starts to leave the encryptor is roughly equal to the time it takes  
for the first ghash pass, and the total time is the sum of the time  
for both operations.  As you point out, ghash can be made to run  
quickly at the cost of increased circuit size and power, so that the - 
first-data-out latency can be made small by cranking up the speed of  
the multiply operation to a rate faster than the AES engine.  (Though  
of course in the second stage, it is desirable to have AES and ghash  
running at the same rate.)    Since EME* has a speed determined by  
the AES engine, XCB can have lower latency, at the cost of a high- 
speed multiplier.

Comments welcome.

David

> Note this analysis ignores two factors:
> bus latency -- the time resources are locked while waiting for data  
> to arrive.  If there is a bus latency issue, it is more likely to  
> be an issue with XCB, where for large data sets data would likely  
> have to be fetched twice.
> control complexity.  LRW is clearly much simpler than either EME or  
> XCB.  EME and XCB both have special challenges to implementation  
> above CCM or GCM, because of message block scheduling.
>
> mt
>