RE: [10GBASE-T] latency
Tom, I DO NOT agree that that is the implication. Not everything should be done in the link layer. In the general case, I don't think flow control should be done in the link layer for a switched network because of the congestion spreading behavior I described below. (For some specialized situations it may be adequate.) QOS would be an imcomplete solution anyway. It might be used to separate the performance the disk array is seeing from the performance vanilla client sees based on putting them in different classes but you would still have potential for a slow member of a class (including one that is only slow due to temporary conditions) slowing down the the rest of its class. I believe QOS belongs above 802.3 at a place where it is MAC independent. Pat
From: Thomas Dineen [mailto:firstname.lastname@example.org]
Sent: Friday, February 20, 2004 1:18 PM
To: THALER,PAT (A-Roseville,ex1)
Cc: email@example.com; firstname.lastname@example.org;
Subject: Re: [10GBASE-T] latency
While I agree with your description below, it sounds like we need
QOS for 802.3!
Kahled are you listening????
>PAUSE really isn't the critical issue for latency. The important issue for latency is performance affect on applications like clustering and storage.
>PAUSE operates over a link and not end-to-end. In your diagram if machine A sends PAUSE, the switch will stop sending to it. If frames build up in memory enough, the switch may assert PAUSE to the server, but this will stop all the traffic from the server including the traffic to other clients. If the traffic was TCP traffic, it would be better for the performance of the network for the switch to drop some frames going to its congested port (machine A) rather then sending PAUSE to the server. The TCP protocol in the server would react to the dropped frames by applying congestion control. The uncongested flows to other machines ideally would be unaffected.
>PAUSE may be useful in some special situations but as Pedro's quote from Rich Seifert's book points out it often isn't a good ideat to enable it. The switch isn't involved in the TCP flow control other than that the switch drops packets when it runs out of buffers. TCP operates in the end nodes and assumes dropped packets are a sign of congestion. If something doesn't use TCP, it usually is supplying its own mechanisms for reliability or it is willing to live with packets being dropped due to congestion.
>Even when one does want to use PAUSE, the cost of supporting extra latency for PAUSE is small and the 10BASE-T latency of media plus PHY is not likely to be more than the media plus PHY latencies of the fiber links (even if one doesn't count the latency of the WIS which is huge).
>P.S. For those who really want to understand why it could be a bad idea to turn flow control on in Stephen's scenario:
>machine A <-- 1 Gig --> Switch <-- 10 Gig --> Server
>disk array <-- 10 Gig ------^
>Say the server is sending to the disk array and to machine A. The server is blind to the speeds of the links and it is queuing up traffic about equally for machine A and the disk array. Because of the difference in link speed, packets build up in switch memory and the switch sends PAUSE to the server. The server has to stop sending everything so its write to the disk array is also stopped. After a while the switch is able to send enough to A that it releases the PAUSE on the server which sends more packets. Again the packets are a mix of packets to machine A and the disk array so pretty soon the switch sends PAUSE to the server again. The cycle continues to repeat. While the transfer is going on the server is getting to send at about 1 Gig over its 10 Gig link to the disk array.
>It might not even be a case where the link is slow. Machine A might be connected on a 10 Gig physical link but it might not be able to keep up with the link speed. If it uses PAUSE to slow the average throughput down, then the same thing will happen. Throughput to a server's fast client is slowed down by PAUSE from a slow client.
>[mailto:email@example.com]On Behalf Of Stephen
>Sent: Friday, February 20, 2004 7:55 AM
>To: Gavin Parnaby
>Subject: RE: [10GBASE-T] latency
>Gavin and Serag
>Gavin, thanks for that overview of the PAUSE functionality. One question
>I have is does anyone know if PAUSE is implemented end-to-end or
>hop-by-hop in an Ethernet connection between source and sink via one or
>more switches? Also, what is the primary method of flow control when the
>following scenario occurs?
>machine A <-- Speed A --> Switch <-- Speed B --> Server
>My concern is if speed A is < Speed B (e.g. A=1000BASE-T and
>B=10GBASE-T) and we use PAUSE to ensure flow control then 10GBASE-T may
>have to include PAUSE. If not then buffers in the switch will overflow
>when machine A requests a large amount of data from the server. Perhaps
>flow control in a switch with multi-speed ports is handled using
>something other than PAUSE? I am assuming this is a low level switch and
>TCP is not used to do flow control, is this correct?
>RDMA considerations are a concern. It's sure to desire as low a latency
>as possible. I believe there is a CFI scheduled for March? As Gavin
>pointed out we have some headroom thanks to the maximum link length in
>wireline versus optical. This headroom is about 8000 baud periods
>(assuming a 833 MHz clock). However perhaps RDMA applications will
>require link lengths a lot less than 3Km.
>However is a low-latency PHY that consumes a lot of power any more
>attractive than a higher-latency PHY that consumes less than half that
>On Fri, 2004-02-20 at 07:55, Gavin Parnaby wrote:
>>I've been working on a NIC-on-a-chip for Gigabit Ethernet, and I have
>>some experience with the way flow control is used in Ethernet
>>The PAUSE function is used to ensure that the receive buffers in a NIC
>>do not overflow. A PAUSE packet is triggered when a station's receive
>>buffer is filled to a high-water mark. When the other station's MAC
>>processes the PAUSE packet, it stops transmitting. The high-water mark
>>level is calibrated so that all data potentially transmitted between the
>>PAUSE packet being sent and it being processed will not overflow the
>>The total latency between a PAUSE packet transmit being requested by
>>station A and the PAUSE packet actually pausing the transmitter in
>>station B determines how much additional data could be received before
>>the flow stops. The processing time in the receiver is a part of this
>>delay, along with the propagation delay, the time to send the PAUSE
>>frame and potentially two maximum-length frames (one on each of station
>>A & B) (these could be jumbo frames).
>>So given this upper bound on the response time, it is possible to set
>>the watermark level so that PAUSE frames will prevent buffer overflow.
>>If a standard increases the processing latency in the receiver then the
>>buffer sizes and watermark level would need to be changed in the
>>controller/switch, as more data would potentially need to be buffered
>>between the transmit/receive of a PAUSE packet. I do not believe that
>>this would create a major problem in the design of the controller.
>>As you say, since the propagation delay of a 3km fiber link is
>>substantially greater than for 100m UTP (~30,000 bit times compared to
>>~1140 bit times), the receive buffer size / space above the watermark
>>level used in fiber controllers should be substantially larger than for
>>Gigabit over copper. Jumbo frames also change the amount of buffering /
>>watermark level needed. I think this indicates that an increase in
>>receiver processing time for 10GBase-T is viable in terms of the PAUSE
>>operation. There may be other requirements regarding RDMA etc.
>>On Fri, 2004-02-20 at 11:47, Stoltz, Mario wrote:
>>>The latency requirements in the standard are based on clause 44.3 (which
>>>refers to clause 31 and annex 31B). Underlying reason for specifying
>>>delay budgets is "Predictable operation of the MAC Control PAUSE
>>>operation" as the standard puts it.
>>>In the 802.3ae days, there was a minor discussion in 2001 around the
>>>latency budgets summed up in table 44.2 "Round-trip delay constraints".
>>>Back then, I commented against the latency numbers of draft 3.0 (which
>>>are now in the standard).
>>>My argument back then was based on two points: a) the fact that the
>>>individual delay numbers in table 44.2 seemed to be built assuming
>>>different semiconductor technologies, and b) the fact that cabling delay
>>>is several orders of magnitude above sublayer delay anyway if we look at
>>>the distance objectives of the optical PHYs. For the sake of economic
>>>feasibility, I proposed relaxing the numbers, but without success.
>>>The current situation seems as if the delay budget threatens to inhibit
>>>technically attractive solutions. What we are probably missing (today as
>>>well as back then) is some data on the MAC control PAUSE operation and
>>>how it is really used in the field. That could tell us how reasonable it
>>>may be to add some slack to the current numbers.
>>>Some data, anyone?
>>>[mailto:firstname.lastname@example.org] On Behalf Of Stephen
>>>Sent: Donnerstag, 19. Februar 2004 18:56
>>>To: Booth, Bradley; email@example.com
>>>Subject: Re: [10GBASE-T] latency
>>>Hi Brad and the 10GBASE-T Group
>>>I used to work for Massana (now part of Agere) but am now an Assistant
>>>Prof at the University of Alberta. I've been talking to some of you
>>>about this latency issue as I think it has a huge bearing on the
>>>viability of 10GBASE-T.
>>>I did some work based on the presentation of Scott Powell and others
>>>that tried to estimate the power consumption of 10GBASE-T components.
>>>Based on present performance criteria and ADCs featuring in ISSCC this
>>>year I concur with his results which show that they are, by far, the
>>>dominant power drain. For this and other reasons I am coming to the
>>>conclusion that the trade off between the SNR target at the decoder
>>>input and coding gain is not appropriate at present (I assuming we are
>>>using the 1000BASE-T code).
>>>Part of my research is involved with coding and decoding in high-speed
>>>systems with ISI. One area of application is obviously 10GBASE-T. I know
>>>Sailesh presented some work on LDPC codes. Another coding option people
>>>have mentioned is a concatenated code. Both of these require that the
>>>latency budget in 10G be relaxed. In the first case because LDPC
>>>requires an iterative decoder and the second since we must interleave
>>>between the two codes.
>>>I have heard the figure of 1us being the limit for MAC to MAC latency in
>>>10G though I've not heard any justification or reasons for this. Even
>>>assuming we can 50% of this in the decoder we still only have about
>>>400-500 baud periods (and hence clock cycles) to play with. This is a
>>>very small figure for both the options above.
>>>I think getting a better idea of what the upper bound on latency needs
>>>to be is very important and I would be interested in hearing people's
>>>opinion on the coding options for 10GBASE-T. I hope to make another of
>>>the study group meetings as soon as my teaching commitments are
>>>If anyone has any questions about this please feel free to contact me.
>>>On Wed, 2004-02-18 at 12:12, Booth, Bradley wrote:
>>>>I remember Sailesh mentioning that if we are willing to make
>>>>trade-offs against latency, that we can make use of significantly more
>>>>powerful techniques to reduce the complexity. I know people have been
>>>>looking at this as a possible issue. What is an acceptable latency
>>>>trade-off? Is the current latency requirement for 1000BASE-T creating
>>>>problems for it in latency sensitive applications?
>>>>Any thoughts or comments?
>>>>Chair, IEEE 802.3 10GBASE-T Study Group