Hari Debate, Response to Richard Taborek's Note to Dan Dove, dated 12/16/99 08:50:38 PM
To Richard Taborek and Hari Interest Group
I feel uncomfortable that I have to express again disagreement with some of
your statements. I had hoped that we could together present the significant
issues with their pros and cons in a factual and fair way so not every
voting member would have to study all the details.
In the above referenced note you make reference to a presentation by Mark
pages 15 and 16, and then you proceed to some conclusions adverse to word
striping. There are elementary flaws in your argumentation: The Figure 3 of
your note does not match the referenced proposal. Regardless of the
specifics, Mark Ritter presented just an example to illustrate how a
maximum number of commas can be provided within the Ethernet and FC
constraints for people who see this as a necessity, which in fact it is
not. The example also shows that word-striping does not require any ordered
set specifically dedicated to synchronization. The comma can be part of
many delimiters such as SOF, EOF, etc. as is common practice in the Fibre
Channel format, so synchronization can be acquired using any ordered set,
i.e. with word striping you get two functions wrapped into a single word.
You find fault with "for Word-Striping the Ethernet Start-of-Packet does
not occur in lane 0". I am mystified why anybody should care, as long as
the words are delivered in the proper format and sequence at the
destination, which they do. Please explain the reason for your objection.
Words are rotated from lane to lane without regard to content and I see no
disadvantage from this procedure. The rotation of SOF from lane to lane has
the beneficial effect of rotating commas from lane to lane.
You also refer to "the relative complexity of determining the last byte of
a packet" (with Word-Striping). We do not propose any changes to the 10 Gb
Ethernet EOF format, certainly not at the MAC interface. So there is no
I think you have misread a chart of the Ritter presentation on page 15. At
the top of the page is a representation of words as they appear at the Hari
parallel input or output, arranged vertically in byte order 0 through 3, as
marked on the left of the diagram. The earliest word is on the left side.
The diagram on the bottom shows the same words in the same configuration as
they might be modified inside the Hari to generate a maximum number of
commas as deemed necessary by some folks (the 4 horizontal rows represent
bytes as indicated on the left, not lanes). For me, a high number of commas
is just desirable, not necessary. I have more to say on this below. The
serialized words on 4 lanes are shown on page 16 (in non-staggered format,
for simplicity). We do not show the order of bytes transmitted on the lanes
as your Figure 3 infers, it is irrelevant anyway for the circuits outside
the SerDes and framing domain.
From objections voiced at meetings and in off-line conversations, I sense
that some people project incorrectly certain conditions and requirements
from the byte-striped approach to the word-striped approach without
adequate study and make wrong inferences. The notion that the start of
frame should be aligned with lane #0 is one of the false assumptions which
surfaces again and again. It has mislead prominent supporters of the byte
striped approach to totally false conclusions about the operation of skips
and insertions which is handled at a higher level where the underlying
number of lanes is not even visible or detectable, therefore, the
assignment of a word to a particular lane can not have any relevancy at
all. To give the word striped approach a fair evaluation, you cannot apply
the operating rules of byte striping, there are basic differences. For
those who do not have the time or inclination to study detailed
presentations on the subject, I will try to explain the process of
staggered word striping here in compact form for the case of four lanes:
The major differences between byte and word striping are the deskewing
technique and the way byte or word synchronization is enabled via commas on
each of the lanes.
1) In a first step, any minor reformatting required is performed (e.g. for
2) A first word is then encoded for transmission on lane 0 using a starting
disparity which equals the ending disparity of the previous word
transmitted on lane 0. The ending disparity is stored in a latch for use by
the next word on the same lane.
3) The serialization of the said first word is started on lane zero as soon
as encoding is complete.
4) Exactly the same steps are applied to the second word, except that it is
transmitted on lane 1 and serialization starts one byte interval later.
5) The steps are repeated for the third and fourth word for transmission on
lanes 2 and 3 respectively.
6) The 5th word fits behind the first word on lane 0 and the process
continues as described above.
Clock and data recovery can be identical for word and byte striping.
However, for word striping, each lane assembles the deserialized words
using derivatives of its respective receiver clock, not a shared clock
phase. Up to this point, there is no interaction and alignment required
among the four lanes. The separate words are then multiplexed into a shared
register using a set of four 78 MHz clocks derived from any one of the 4
input clocks by a hardwired arbitrary selection. The phase of each of the
four clocks is offset by a quarter cycle (3.2 ns) from the next preceding
clock. The shared register may be the first word cell of the FIFO, if rate
adjustments are required via skip word removals or insertions. Clocking out
of the register or FIFO is always done by a single shared 78 MHz clock
which may differ from the input rate in the case of a FIFO. The large 12.8
ns interval associated with each of the 4 inputs to the shared buffer
allows hardwired skew elimination for peak to peak skews among the lanes in
excess of 6 ns. For details of the relationship between the clocks and the
skewed data transitions see the diagrams at:
Once word sync has been acquired on all lanes, no more commas are required
for as long as synchronization is maintained. Skew drift (within the 6 ns
specifications) does not affect operation, in contrast to byte striping,
which may require adjustments to compensate for drift in delays. While the
transport structure is word oriented, the content can be formatted in any
desired fashion using control characters. The only rule is that the format
must not introduce any misaligned commas. Isolated spurious commas which
can be generated by transmission errors are filtered out, e.g. by a two-bit
reversible counter stepped up by aligned commas and down by misaligned
commas enabling resynchronization only in the `00' state. Well designed
circuits should maintain synchronism over periods of months. Operation in
this mode is the ultimate in protocol independence and more difficult to
achieve with byte striped operation. A misalignment would most likely be
the consequence of a failure in the clock recovery loop. however, the byte
striped approach has an additional failure mode in the skew control loop
which must be active continuously to correct skew drift over time. Recovery
from a failure generally will by associated with a longer traffic
interruption for the byte striped version.
The reasons why we like nevertheless a decent density of commas can be
summarized as follows:
1) The 3.125 Gbaud lanes may not operate as flawlessly as similar lower
2) If a loss of word synchronization on any of the lanes occurs because of
clock circuit degradation or in the event of excessive externally induced
noise, we would like to recover synchronization as quickly as possible so
no more than a handful of packets per sync slip are lost until the failing
hardware can be replaced or bypassed.
3) The commas provide useful diagnostic information and can easily be
provided in a compatible way with the known protocols.
To repeat again, word-striping does not constrain formatting in any other
way than limiting where commas can be placed (It must be an agreed uniform
byte position within a word (0, 1, 2, or 3). There are no hard requirements
for comma density except for initialization. Even during initialization, it
is not required that commas be present on all lanes at the same time. All
lanes acquire word alignment independently of each other.
Clearly, we can multiplex a Fibre Channel stream unchanged over a
word-striped transport system. Moreover, we can freely choose the number of
lanes, 5, 4, 2, or 1 and convert easily from one number of lanes to another
without any change except adjustments in the disparities. Likewise, an
Ethernet format as described by H. Frazier is compatible with the exception
of the Idle (which already is a modification of 1Gb version). It should not
be difficult to reach agreement on a common Idle format. Note that the
Idles would not even have to be exactly alike. Also, if you need a special
Skip word, it can readily be defined with sufficient Huffman distance from
We have the opportunity to study some improvements in word formats over
existing practice as seen on the serial links without affecting anything at
the established interfaces. We have made some limited suggestions in this
direction (to simplify comma detection and disparity adjustments) and I am
disappointed that you have used these optional changes in your
argumentation against word striping. It is a totally separate issue and of
secondary importance. Since these changes are visible only in the
serialized formats, they should be of little concern to the protocol
architects. If a majority does not want to consider improvements, such
efforts can be abandoned.
The format for word striping is independent of the number of lanes. In this
context, one might ask whether the rate of 3.125 Gbaud per lane is not too
aggressive for a number of reasons. Some competent and experienced people
from several companies think so. A word striped Hari with 5 lanes provides
1) Implementation in a less aggressive technology which may be important
because the macros have to be implemented not just in small transceiver
related chips but also in large protocol chips and should not dictate the
technology selection for the large chips.
2) The transmission rate at 2.5 Gbaud matches the Infiniband rate. A lot of
standardization work related to the physical link specifications and the
circuit designs can be shared.
3) A very low density of commas will provide reliable word synchronization
information on all five lanes, because only commas which are modulo five
words apart can show up on the same lane as the previous comma.
4) If several independent 1.250 Gbaud Ethernet links are trunked together,
there are more design options because 2 links perfectly fit into a single
lane. Conversion from 5 lanes to 4, 2, or a single lane is straightforward
and not circuit intensive.
Synchronization: The word striped solution provides enough commas on all
lanes with normal ordered sets. The 4-lane byte striped version requires a
dedicated four-word sequence for synchronization, continuous monitoring of
skew, and adjustment of skew parameters. The four-word sequence, if carried
over to the media, is not suitable for lane numbers other than 4 and must
be translated. In contrast to word striping, the sequence is wholly
dedicated to synchronization and skew adjustment. It cannot be used for
anything else, it wastes bandwidth in the interframe or packet gap. Word
striping gives more freedom to allocate this transport capacity for
functions such as:
1) Flow control
2) Lane Identification in trunking mode
3) Packet length for scramblers on the media side of Hari
4) Comma character for coded scrambled data if the scrambler is on the MAC
side. For such a configuration, a known simple synchronization technique
overwrites a SONET framing character with a comma character for
transmission over the coded segment and then restores the original format
at the receiving end. For the case of 5 lanes, a single comma an even
number of words apart will provide successive synchronization on all 5
lanes. It is more difficult to do this with the byte striped format in
spite of its vaunted protocol independence because of the more complex and
larger synchronization pattern..
Complexity: You have addressed this item in several communications (e.g. to
Mark Ritter 12/12/99 03:54:11 AM), but I have the impression that you
equate complexity with number of circuits or silicon area. The circuit area
for the two approaches is comparable and not different enough to tilt the
selection one way or the other. The circuit count is only one of several
contributing factors to complexity. I would certainly rate generally any
control loop as more complex than multiplexers and register to register
transfers at rates which can readily be handled by standard logic circuits.
Custom circuits and analog circuits should also be classified as more
complex than standard logic. Circuits of this type are more costly to
develop for high volume manufacturability. There is also more effort
involved to develop suitable production tests and such tests usually
require more circuit overhead than standard logic circuits. The byte
striped solution requires a circuit macro to align the 4 lanes first at the
baud interval and then at the byte interval. This is by its nature a
control loop which operates on inputs from other control loops (Clock
recovery) and almost certainly includes some high speed custom circuits. I
think most designers would classify it as a complex circuit. Traditional
circuits performing the identical or comparable function at lower signaling
rates are familiar power hogs. Things may be different now, but it
certainly would help understanding your position if you could point to some
applicable literature references, I would not dare to insist on an exact
circuit count or part number! Instead, I would prefer an answer to why Hari
should be saddled with this extra, complex circuit macro when we can
clearly do without it by adopting word striping.
In a previous note, Mark Ritter referred to the deskew macro as extra
complexity and you have dismissed this reminder in your note of 12/12/99
03:54:11 AM as a mere "emphatic assertion". Granted, you have never
emphatically denied the need for such a control loop, but you have managed
to consistently ignore its existence and leave it out of any evaluation.
Can you make the case that deskewing in all its aspects at the stated rate
with byte striping is simpler than the extra multiplexers and register to
register transfers at 78 MHz required for word striping?
In the same note you disagree with the statement "Byte striping makes link
deskew much more complex and requires a unique initialization sequence to
deskew fully" and ask for proof to the contrary. The proof is that word
striped systems without this skew alignment circuit operate in mainstream
products of major companies. From your own presentations, it is evident
that byte striping relies on a delicately crafted multiple word sequence
for initialization and, contrary to your assertions in an other note, word
striping can recover in normal traffic from a loss of synchronization.
Since deskew timing is hard wired for the word striped approach, no
adjustments are needed beyond word synchronization, independent for each
Power Dissipation: In several notes you have claimed an advantage for byte
striping which is not justified for the following reasons:
1) The byte striped version requires a skew control loop as referenced
above which includes custom circuits for the inter-lane skew alignment
which unlikely operates with negligible power. There is no such circuit
required for word striping and I see no accounting for this difference in
2) You have correctly identified extra registers and multiplexers for the
word striped approach and attributed extra power to these circuits without
accounting for the differences in switching rates. I have shown in a
previous note that the power in this area can be expected to be comparable
based on a well known equation. Neither you nor anyone else has claimed
that the equation does not apply to the situation or is used incorrectly.
You simply ignored it and repeated without any new justification your claim
of an advantage.
The best case you can make is that perhaps the difference in power is not
significant compared to the power used in clock recovery and off-chip
drivers. However, a significant power penalty could accrue from the byte
striped deskew circuit. Since word striped Hari does not have a comparable
circuit, it is up to the advocates byte striping to present the case that
their deskewing technique uses negligible power, unless we all assent to a
privileged status for the byte striped proposal.
Protocol Independence: You repeat this vague claim over and over again, but
not once have you shown why word striping would be any worse. Outside the
Idle portion, both approaches have equally loose constraints, and for the
Idle portion, byte striping is clearly more restrictive which can be
harmful as explained for a scrambling example above. Protocol compatibility
with Infiniband is of secondary importance because of the different data
rates and other differences.
Mapping for 10 Gb Fibre Channel and Ethernet: I find it surprising and
strange that you assert an advantage here. Please recall the many hours and
days you and others put into the effort to define a suitable Idle/Skip
structure for byte-striping and then reflect how much simpler it would have
been within the word striped constraints.
PMD Dependencies: Please reveal to us for which type of PMD word striping
is worse and why. I have explained in a previous note why the KKKK/RRRR
sequence is not suitable for the 12.5 Gbaud PMD.
You also published another note on related issues on 12/16/99 05:07:52 AM.
I disagree with many of the facts and conclusions presented there as well
but refrain from a point by point refutation. I am confident that readers
familiar with both byte and word striping can figure out the reasons for my
disagreement themselves, otherwise, I can be reached at the phone below
most of the time.
Albert Widmer Tel. 914 945-2047 Email:
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598-0218