Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Does Ten-Gigabit Ethernet need fault tolerance?




Mick,

What level of expertise is required to properly implement a fail over scheme
using P802.3ad?  How many people in the world currently have the technical
expertise to do this?

Thank you,
Roy Bynum
MCI WorldCom



Mick Seaman wrote:

> Roy,
>
> No (see your assertion below), it is going to add very little cost indeed
> and we know this because we (several vendors) have already implemented
> prestandard versions of link aggregation on low cost switches. P802.3ad
> takes the results of this field experience (many thousands of systems
> shipped) and provides a solution that is plug and play factory default
> maximal link aggregation (at the switch vendor's option) across a very wide
> range of system implementations. The cost is the memory cost of the
> implementation, processing is close to negligible.
>
> I suggest you look at P802.3ad, this will be sent for 802.3 working group
> ballot very shortly.
> I have put together a simulation of this (so you can see how big an
> implementation would be, albeit a model implementation) with a number of
> test cases, this is also publicly available.
>
> 802.3ad can hold one link as a standby for the other. The protocol
> information exchanged and the rules for using it allow both systems to
> select the same link to use (rather than each making a separate transmit
> decision), however I believe that the most likely configuration uses
> admission control at the edge of the switched network to ensure that the
> capacity after failure is sufficient to handle the applied load.
>
> Mick
>
> -----Original Message-----
> From: Roy Bynum [mailto:rabynum@xxxxxxxxxxx]
> Sent: Monday, July 19, 1999 9:52 PM
> To: Mick Seaman
> Cc: stds-802-3-hssg@xxxxxxxx
> Subject: Re: Does Ten-Gigabit Ethernet need fault tolerance?
>
> Mick,
>
> What you are describing is similar to the N+1 architecture of SONET/SDH
> linear
> spans, with the subscription rate monitored to the maximum of N.  The
> monitoring
> of this is going to a "woolly booger".  The system processing to control
> this is
> going to add a lot of cost to the systems.  This is one reason that carriers
> are
> going to simpler architectures.
>
> Will P802.3ad work with only two links and not loose any transport bandwidth
> in
> the event of the failure of one link?  If it does not, it is not fault
> tolerant,
> it is fault allowant.
>
> Thank you,
> Roy Bynum
> MCI WorldCom
>
> Mick Seaman wrote:
>
> > The traffic restoration time of P802.3ad will be as fast as:
> >
> > 1) Failure of an individual link (in a set of aggregated links) can be
> > detected.
> >
> > 2) Each of the connected systems (routers, switches, or end stations) can
> > move 'flows' from that link to another link.
> >
> > The second of these, moving flows, is entirely under the control of the
> > system that is transmitting the frames (distribution of flows is not
> > symmetric). That is to say each transmitter at each end of the link can
> > independently start use another link other than the failed link.
> >
> > The constraints on moving the flows are therefore:
> >
> > (a) how fast the transmitting system can react to link failure
> >
> > (b) any deliberate allowance that that system may make to preserve
> ordering
> > because the remote system takes time to 'collect' the frames from each
> > individual link - see the P802.3ad draft for a detailed discussion of
> this,
> > which makes it easier for links to be terminated on separate physical
> pieces
> > of hardware (cards etc.).
> >
> > Disregarding (b) for the present - see more below - recovery from a failed
> > link takes place without any protocol being exchanged between the
> > communicating systems so that (2) can be as short a time as the
> implementor
> > cares to make it. Given the simplicity of the decision to be made - it can
> > be precomputed, if this link fails I do X - it is not unreasonable to do
> > this at interrupt time in a software system. So it happens that failover
> has
> > been implemented in very low cost (as cheap as managed Ethernet gets)
> volume
> > products within 20 milliseconds. More attention to the subsystem design
> > could naturally achieve failover rather more quickly though I don't know
> > anyone is bothered to strive for sub 10 millisecond yet.
> >
> > How much allowance is made for (b) is an implementation issue, but can be
> > minimized by receive side switch design. If this became a hot area I am
> sure
> > that vendors would start quoting it.In the IP world the right answer is to
> > make no allowance at all.
> >
> > Going on from P802.3ad, a similar approach can be used to dramatically
> > improve 802.1D spanning tree reconfiguration. Again a choice of the link
> (or
> > set of aggregated links) to failover to can be preselected for dual
> > redundant tree topologies - failover can be initiated without the
> > transmission of additional protocol messages - and again 20 milliseconds
> > failover is achieved. Achieving these failover times for failure of any
> > single link in the network requires careful consideration of the topolgy
> to
> > be used, however work underway in 802.1 is designed to free spanning tree
> > reconfiguration times from all worse case end to end timer estimates, so
> > where messages need to be exchanged the 'old' very slow spanning tree
> > reconfiguration times no longer apply. This work is underway as P802.1w
> > Rapid Reconfiguration. It will result in protocol enhancements that will
> > plug and play with existing switches - though those switches won't achieve
> > the rapid reconfiguration times (new switches attached to them may well
> do,
> > depends on topology). Rapid reconfiguration at this level protects a level
> 2
> > network of switches.
> >
> > It should be noted that P802.3ad may well reconfigure from link failure on
> > an aggregated link basis. This is the best approach for aggregated links
> > since it does not disrupt topology at all (to first order, capacity
> changes
> > may well require traffic redistribution at layer 3 or more probably at
> layer
> > 2.5 (MPLS)). Above this spanning tree can be used for rapid local
> > reconfiguration to avoid disrupting routing. Above that routing may have
> to
> > kick in to rediscover best paths etc. Again it is possible for routing to
> > precompute alternates on a local level and failover to them rapidly as a
> > first response to failure, good network topology is however required for
> > this to be effective and is may be impossible on a long haul basis. MPLS
> > failover may well be more challenging and require setting up the MPLS
> label
> > switching path again - I think this is the only recovery strategy where
> > RSVP-like MPLS signalling is used. However these problems, which are very
> > real, will not be helped at all by local link recovery except as in as
> much
> > as that completely masks failure as in the P802.3ad Link Aggregation case.
> >
> > Note that in the case where protocol messages are sent to aid rapid
> > reconfiguration the design challenge is to minimize the number of messages
> > sent to achieve a given timing. Almost any protocol (well any correct
> > protocol of which there are rather fewer) can achieve rapid
> reconfiguration
> > if the designer is allowed to send many messages, just depends on how much
> > processing is tolerated.
> >
> > All of the above, IMHO, indicates to me that Ethernet should stick to
> > providing timely indication of link failure, and making sure that both
> > communicating systems see the failure. Timely indication of link recovery
> is
> > also desirable but since this will undoubtedly be confirmed by a protocol
> > excahnge before the link is brought into service, it is less critical.
> >
> > Mick
> >
> > -----Original Message-----
> > From: owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx
> > [mailto:owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx]On Behalf Of Roy Bynum
> > Sent: Sunday, July 18, 1999 5:33 PM
> > To: mick@xxxxxxxxxxx
> > Cc: stds-802-3-hssg@xxxxxxxx
> > Subject: Re: Does Ten-Gigabit Ethernet need fault tolerance?
> >
> > Mick,
> >
> > When implemented how fast is the FT traffic restoration of P802.3ad
> supposed
> > to
> > work?  From that restoration time, calculate how much data got lost.  One
> of
> > the
> > major features of tightly coupled error detection is that the traffic
> > restoration times are greatly reduced.  At present, the fastest that L3
> > Matrix
> > (I can not use any vendor implementation names.), MPLS, or L1/REI/L3 is
> > proposed
> > is 10 times slower than 802.3 that is directly mapped into SONET.  (There
> > are
> > several transport vendors that are doing this.)  There are other issues
> with
> > fiber maintenance that have not been issues before because the amount of
> > aggregated data and the "round the clock" that 10GbE will likely see.
> This
> > makes the FT as much a function of the PHY as anywhere else.
> >
> > Thank you,
> > Roy Bynum
> > MCI WorldCom
> >
> > "Mick Sea,man" wrote:
> >
> > > What needs to be built in is the detection of failure. What we don't
> need
> > to
> > > do is to build everything into the MAC. I suggest you look at the fault
> > > tolerant capabilities provided by P802.3ad and the work on Rapid
> > > Reconfiguration starting in 802.1.
> > >
> > > Both these (will) provide a degree of fault tolerance based on using
> > > protocols that are independent of MAC details to allow network nodes to
> > > precalculate their response to a low level indication of failure. There
> is
> > > really no need to build these protocols into the MAC.
> > >
> > > Mick
> > >
> > > -----Original Message-----
> > > From: owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx
> > > [mailto:owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx]On Behalf Of Joe Gwinn
> > > Sent: Friday, July 16, 1999 3:15 PM
> > > To: stds-802-3-hssg@xxxxxxxx
> > > Subject: Does Ten-Gigabit Ethernet need fault tolerance?
> > >
> > > The purpose of this note is to present a case for inclusion of fault
> > > tolerance in 10GbE, and to offer a suitable proven technology for
> > > consideration.  However, no salesman will call.
> > >
> > > Why Fault Tolerance?  Ten-Gigabit Ethernet is going to be a relatively
> > > expensive, high-performance technology intended for major backbones,
> > > perhaps even nibbling at the bottom end of the wide-area network (WAN)
> > > market.  In such applications, high availability is very much desired;
> > loss
> > > of such a backbone or WAN is much too disruptive (and therefore
> expensive)
> > > to be much tolerated, and this kind of a market will gladly pay a
> > > reasonable premium to achieve the needed fault tolerance.
> > >
> > > Why add Fault Tolerance now?  Because it's easiest (and thus cheapest)
> if
> > > done from the start, and because having FT built in and therefore
> becoming
> > > ubiquitous will be a competitive discriminator, neutralizing one of the
> > > remaining claimed advantages of ATM.
> > >
> > > Isn't Fault Tolerance difficult?  In hub-and-spoke (logical star,
> physical
> > > loop) topologies such as GbE and10GbE, it's not hard to achieve both
> fault
> > > tolerance (FT) and military-level damage tolerance (DT).  In networks of
> > > unrestricted topology, it's a lot harder.  The presence of bridges does
> > not
> > > affect this conclusion.
> > >
> > > How do I know that FT is so easily achieved?  Because it's already been
> > > done, may be bought commercially, and is in use on one military system
> and
> > > is proposed for others.  The FT/DT technology mentioned here was
> developed
> > > on a US Navy project, and is publically available without intellectual
> > > property restrictions.  Why was the technology made public?  To
> encourage
> > > its adopotion and use in COTS products, so that defense contractors can
> > buy
> > > FT/DT lans from catalogs, rather than having to develop them again and
> > > again, at great risk and expense.
> > >
> > > What is the difference between Fault Tolerance and Damage Tolerance?  In
> > > fault tolerance, faults are rare and do not correlate in either time or
> > > place. The classic example is the random failure of hardware components.
> > > (Small acts of damage, such as somebody tripping over a wire or breaking
> a
> > > connector somewhere, are treated as faults here because they are also
> rare
> > > and uncorrelated.) In damage tolerance, the individual faults are
> sharply
> > > correlated in time and place, and are often massive in number. The
> classic
> > > military example is a weapon strike. In the commercial world, a major
> > power
> > > failure is a good example. Damage tolerance is considered much harder to
> > > accomplish than fault tolerance. If you have damage tolerance, you also
> > > have fault tolerance, but fault tolerance does not by itself confer
> damage
> > > tolerance.
> > >
> > > How is this Damage Tolerance achieved?  All changes in LAN segment
> > topology
> > > (the loss or gain of nodes (NICs), hubs, or fibers) are detected in MAC
> > > hardware by the many link receivers, which report both loss and
> > acquisition
> > > of modulated light. This surveillance occurs all the time on all links,
> > and
> > > is independent of data traffic. Any change in topology provokes the
> > > hardware into "rostering mode", the automatic exploration of the segment
> > > using a flood of special "roster" packets to find the best path, where
> > > "best" is defined as that path which includes the maximum number of
> nodes
> > > (NICs).
> > >
> > > Just how fault tolerant and damage tolerant is this scheme?  A segment
> > will
> > > work properly with any number of nodes and hubs, if sufficient fibers
> > > survive to connect them together, and will automatically configure
> itself
> > > into a working segment within a millisecond of the last fault. If the
> > > number of broken fibers is less than the number of hubs, all surviving
> > > nodes will remain accessible, regardless of the fault pattern. If the
> > > number of fiber breaks is equal to or greater than the number of hubs,
> > > there is a simple equation to predict the probability of loss of access
> to
> > > a typical node due to loss of hubs and/or fibers, given only the number
> of
> > > hubs and the probability of any fiber breaking: Pnd[p,r]= ((2p)(1-p))^r,
> > > where p is the probability of fiber breakage and r is the number of
> > > surviving hubs (which ranges from zero to four in a quad system). This
> > > equation is exact (to within 1%) for fiber breakage probabilities of 33%
> > or
> > > less, and applies for any number of hubs.
> > >
> > > The simplicity of this equation is a consequence of the simplicity of
> this
> > > protocol, which is currently implemented in standard-issue FPGAs (not
> > > ASICs), and works without software intervention.  It can also be
> > > implemented in firmware.
> > >
> > > To give a numerical example, in a 33-node 4-hub segment, loss of 42
> fibers
> > > (16% of the segment's 264 fibers) would lead to only 0.5% of the nodes
> > > becoming inaccessible, on average. Said another way, after 42 fiber
> > breaks,
> > > there are only five chances out of a thousand that a node will become
> > > inaccessible. This is very heavy damage, with one fiber in six broken.
> To
> > > take a more likely example, with three broken fibers, all nodes will be
> > > accessible, and with four broken fibers, there is less than one chance
> in
> > a
> > > million that a node will become inaccessible. Recovery takes two ring
> tour
> > > times plus settling time (electrical plus mechanical), typically less
> than
> > > one millisecond in ship-size networks, measured from the last fault.
> > > Chattering and/or intermittent faults can be handled by a number of
> > > mechanisms, including delaying node entry by up to one second. Few
> current
> > > LAN technologies approach this degree of resilience, or speed of
> recovery.
> > >
> > > In commercial systems and some military systems, a dual-ring solution is
> > > sufficient.  Up to quad-ring solutions are comercially available, needed
> > > for some military systems.  However, the ability to support up to quad
> > > redundant systems should be provided in 10GbE, for two reasons.  First,
> > > quad is needed for the military market, which may be economically
> > > significant in the early years of 10GbE.  Second, quad provides a clear
> > > growth path and a way to reassure non-military customers that their most
> > > stringent problems can be solved: One can ask them if their needs really
> > > exceed those of warships duelling with supersonic missiles.
> > >
> > > The basic technical document, the RTFC Principles of Operation, is on
> the
> > > GbE website as "http://grouper.ieee.org/ groups/802/3/ 10G_study/public/
> > > email_attach/ gwinn_1_0699.pdf" and "http://grouper.ieee.org/
> > > groups/802/3/10G_study/ public/ email_attach/ gwinn_2_0699.pdf".   I was
> a
> > > member of the team that developed the technology, and am the author of
> > > these documents.
> > >
> > > Although these documents assume RTFC, a form of distributed shared
> memory,
> > > the basic rostering technology can easily be adapted for Gigabit and
> > > Ten-Gigabit Ethernet as well.  For nontechnical reasons, RTFC originally
> > > favored smart nodes connected via dumb hubs.  However, the overall
> design
> > > can be somewhat simplified if one goes the other way, to dumb nodes and
> > > smart hubs.  This also allows the same dumb nodes to be used in both
> > non-FT
> > > and FT networks, increasing node production volume, and does not force
> > > users to throw nodes away to upgrade to FT.
> > >
> > > I therefore would submit that 10GbE would greatly benefit from fault
> > > tolerance, and also that it's very easily achieved if included in the
> > > original design of 10GbE.
> > >
> > > Joe Gwinn