Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

RE: Does Ten-Gigabit Ethernet need fault tolerance?




I think the number is into double-digits already, and will rise
significantly on approval and wider dissemination of the standard. The
control protocol task is less complex than any of the trivial routing
protocols so the 'total addressable implementors' probably number in the
hundreds.

But don't take my word for it, make your own assessment of the document. In
any case the leading switch manufactureres already have multiple
implementors who are familiar with this technology, for those who are less
familiar with it I am sure there are consultants who would be happy to
oblige.

Mick

-----Original Message-----
From: Roy Bynum [mailto:rabynum@xxxxxxxxxxx]
Sent: Tuesday, July 20, 1999 6:27 PM
To: Mick Seaman
Cc: stds-802-3-hssg@xxxxxxxx
Subject: Re: Does Ten-Gigabit Ethernet need fault tolerance?


Mick,

What level of expertise is required to properly implement a fail over scheme
using P802.3ad?  How many people in the world currently have the technical
expertise to do this?

Thank you,
Roy Bynum
MCI WorldCom



Mick Seaman wrote:

> Roy,
>
> No (see your assertion below), it is going to add very little cost indeed
> and we know this because we (several vendors) have already implemented
> prestandard versions of link aggregation on low cost switches. P802.3ad
> takes the results of this field experience (many thousands of systems
> shipped) and provides a solution that is plug and play factory default
> maximal link aggregation (at the switch vendor's option) across a very
wide
> range of system implementations. The cost is the memory cost of the
> implementation, processing is close to negligible.
>
> I suggest you look at P802.3ad, this will be sent for 802.3 working group
> ballot very shortly.
> I have put together a simulation of this (so you can see how big an
> implementation would be, albeit a model implementation) with a number of
> test cases, this is also publicly available.
>
> 802.3ad can hold one link as a standby for the other. The protocol
> information exchanged and the rules for using it allow both systems to
> select the same link to use (rather than each making a separate transmit
> decision), however I believe that the most likely configuration uses
> admission control at the edge of the switched network to ensure that the
> capacity after failure is sufficient to handle the applied load.
>
> Mick
>
> -----Original Message-----
> From: Roy Bynum [mailto:rabynum@xxxxxxxxxxx]
> Sent: Monday, July 19, 1999 9:52 PM
> To: Mick Seaman
> Cc: stds-802-3-hssg@xxxxxxxx
> Subject: Re: Does Ten-Gigabit Ethernet need fault tolerance?
>
> Mick,
>
> What you are describing is similar to the N+1 architecture of SONET/SDH
> linear
> spans, with the subscription rate monitored to the maximum of N.  The
> monitoring
> of this is going to a "woolly booger".  The system processing to control
> this is
> going to add a lot of cost to the systems.  This is one reason that
carriers
> are
> going to simpler architectures.
>
> Will P802.3ad work with only two links and not loose any transport
bandwidth
> in
> the event of the failure of one link?  If it does not, it is not fault
> tolerant,
> it is fault allowant.
>
> Thank you,
> Roy Bynum
> MCI WorldCom
>
> Mick Seaman wrote:
>
> > The traffic restoration time of P802.3ad will be as fast as:
> >
> > 1) Failure of an individual link (in a set of aggregated links) can be
> > detected.
> >
> > 2) Each of the connected systems (routers, switches, or end stations)
can
> > move 'flows' from that link to another link.
> >
> > The second of these, moving flows, is entirely under the control of the
> > system that is transmitting the frames (distribution of flows is not
> > symmetric). That is to say each transmitter at each end of the link can
> > independently start use another link other than the failed link.
> >
> > The constraints on moving the flows are therefore:
> >
> > (a) how fast the transmitting system can react to link failure
> >
> > (b) any deliberate allowance that that system may make to preserve
> ordering
> > because the remote system takes time to 'collect' the frames from each
> > individual link - see the P802.3ad draft for a detailed discussion of
> this,
> > which makes it easier for links to be terminated on separate physical
> pieces
> > of hardware (cards etc.).
> >
> > Disregarding (b) for the present - see more below - recovery from a
failed
> > link takes place without any protocol being exchanged between the
> > communicating systems so that (2) can be as short a time as the
> implementor
> > cares to make it. Given the simplicity of the decision to be made - it
can
> > be precomputed, if this link fails I do X - it is not unreasonable to do
> > this at interrupt time in a software system. So it happens that failover
> has
> > been implemented in very low cost (as cheap as managed Ethernet gets)
> volume
> > products within 20 milliseconds. More attention to the subsystem design
> > could naturally achieve failover rather more quickly though I don't know
> > anyone is bothered to strive for sub 10 millisecond yet.
> >
> > How much allowance is made for (b) is an implementation issue, but can
be
> > minimized by receive side switch design. If this became a hot area I am
> sure
> > that vendors would start quoting it.In the IP world the right answer is
to
> > make no allowance at all.
> >
> > Going on from P802.3ad, a similar approach can be used to dramatically
> > improve 802.1D spanning tree reconfiguration. Again a choice of the link
> (or
> > set of aggregated links) to failover to can be preselected for dual
> > redundant tree topologies - failover can be initiated without the
> > transmission of additional protocol messages - and again 20 milliseconds
> > failover is achieved. Achieving these failover times for failure of any
> > single link in the network requires careful consideration of the topolgy
> to
> > be used, however work underway in 802.1 is designed to free spanning
tree
> > reconfiguration times from all worse case end to end timer estimates, so
> > where messages need to be exchanged the 'old' very slow spanning tree
> > reconfiguration times no longer apply. This work is underway as P802.1w
> > Rapid Reconfiguration. It will result in protocol enhancements that will
> > plug and play with existing switches - though those switches won't
achieve
> > the rapid reconfiguration times (new switches attached to them may well
> do,
> > depends on topology). Rapid reconfiguration at this level protects a
level
> 2
> > network of switches.
> >
> > It should be noted that P802.3ad may well reconfigure from link failure
on
> > an aggregated link basis. This is the best approach for aggregated links
> > since it does not disrupt topology at all (to first order, capacity
> changes
> > may well require traffic redistribution at layer 3 or more probably at
> layer
> > 2.5 (MPLS)). Above this spanning tree can be used for rapid local
> > reconfiguration to avoid disrupting routing. Above that routing may have
> to
> > kick in to rediscover best paths etc. Again it is possible for routing
to
> > precompute alternates on a local level and failover to them rapidly as a
> > first response to failure, good network topology is however required for
> > this to be effective and is may be impossible on a long haul basis. MPLS
> > failover may well be more challenging and require setting up the MPLS
> label
> > switching path again - I think this is the only recovery strategy where
> > RSVP-like MPLS signalling is used. However these problems, which are
very
> > real, will not be helped at all by local link recovery except as in as
> much
> > as that completely masks failure as in the P802.3ad Link Aggregation
case.
> >
> > Note that in the case where protocol messages are sent to aid rapid
> > reconfiguration the design challenge is to minimize the number of
messages
> > sent to achieve a given timing. Almost any protocol (well any correct
> > protocol of which there are rather fewer) can achieve rapid
> reconfiguration
> > if the designer is allowed to send many messages, just depends on how
much
> > processing is tolerated.
> >
> > All of the above, IMHO, indicates to me that Ethernet should stick to
> > providing timely indication of link failure, and making sure that both
> > communicating systems see the failure. Timely indication of link
recovery
> is
> > also desirable but since this will undoubtedly be confirmed by a
protocol
> > excahnge before the link is brought into service, it is less critical.
> >
> > Mick
> >
> > -----Original Message-----
> > From: owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx
> > [mailto:owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx]On Behalf Of Roy Bynum
> > Sent: Sunday, July 18, 1999 5:33 PM
> > To: mick@xxxxxxxxxxx
> > Cc: stds-802-3-hssg@xxxxxxxx
> > Subject: Re: Does Ten-Gigabit Ethernet need fault tolerance?
> >
> > Mick,
> >
> > When implemented how fast is the FT traffic restoration of P802.3ad
> supposed
> > to
> > work?  From that restoration time, calculate how much data got lost.
One
> of
> > the
> > major features of tightly coupled error detection is that the traffic
> > restoration times are greatly reduced.  At present, the fastest that L3
> > Matrix
> > (I can not use any vendor implementation names.), MPLS, or L1/REI/L3 is
> > proposed
> > is 10 times slower than 802.3 that is directly mapped into SONET.
(There
> > are
> > several transport vendors that are doing this.)  There are other issues
> with
> > fiber maintenance that have not been issues before because the amount of
> > aggregated data and the "round the clock" that 10GbE will likely see.
> This
> > makes the FT as much a function of the PHY as anywhere else.
> >
> > Thank you,
> > Roy Bynum
> > MCI WorldCom
> >
> > "Mick Sea,man" wrote:
> >
> > > What needs to be built in is the detection of failure. What we don't
> need
> > to
> > > do is to build everything into the MAC. I suggest you look at the
fault
> > > tolerant capabilities provided by P802.3ad and the work on Rapid
> > > Reconfiguration starting in 802.1.
> > >
> > > Both these (will) provide a degree of fault tolerance based on using
> > > protocols that are independent of MAC details to allow network nodes
to
> > > precalculate their response to a low level indication of failure.
There
> is
> > > really no need to build these protocols into the MAC.
> > >
> > > Mick
> > >
> > > -----Original Message-----
> > > From: owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx
> > > [mailto:owner-stds-802-3-hssg@xxxxxxxxxxxxxxxxxx]On Behalf Of Joe
Gwinn
> > > Sent: Friday, July 16, 1999 3:15 PM
> > > To: stds-802-3-hssg@xxxxxxxx
> > > Subject: Does Ten-Gigabit Ethernet need fault tolerance?
> > >
> > > The purpose of this note is to present a case for inclusion of fault
> > > tolerance in 10GbE, and to offer a suitable proven technology for
> > > consideration.  However, no salesman will call.
> > >
> > > Why Fault Tolerance?  Ten-Gigabit Ethernet is going to be a relatively
> > > expensive, high-performance technology intended for major backbones,
> > > perhaps even nibbling at the bottom end of the wide-area network (WAN)
> > > market.  In such applications, high availability is very much desired;
> > loss
> > > of such a backbone or WAN is much too disruptive (and therefore
> expensive)
> > > to be much tolerated, and this kind of a market will gladly pay a
> > > reasonable premium to achieve the needed fault tolerance.
> > >
> > > Why add Fault Tolerance now?  Because it's easiest (and thus cheapest)
> if
> > > done from the start, and because having FT built in and therefore
> becoming
> > > ubiquitous will be a competitive discriminator, neutralizing one of
the
> > > remaining claimed advantages of ATM.
> > >
> > > Isn't Fault Tolerance difficult?  In hub-and-spoke (logical star,
> physical
> > > loop) topologies such as GbE and10GbE, it's not hard to achieve both
> fault
> > > tolerance (FT) and military-level damage tolerance (DT).  In networks
of
> > > unrestricted topology, it's a lot harder.  The presence of bridges
does
> > not
> > > affect this conclusion.
> > >
> > > How do I know that FT is so easily achieved?  Because it's already
been
> > > done, may be bought commercially, and is in use on one military system
> and
> > > is proposed for others.  The FT/DT technology mentioned here was
> developed
> > > on a US Navy project, and is publically available without intellectual
> > > property restrictions.  Why was the technology made public?  To
> encourage
> > > its adopotion and use in COTS products, so that defense contractors
can
> > buy
> > > FT/DT lans from catalogs, rather than having to develop them again and
> > > again, at great risk and expense.
> > >
> > > What is the difference between Fault Tolerance and Damage Tolerance?
In
> > > fault tolerance, faults are rare and do not correlate in either time
or
> > > place. The classic example is the random failure of hardware
components.
> > > (Small acts of damage, such as somebody tripping over a wire or
breaking
> a
> > > connector somewhere, are treated as faults here because they are also
> rare
> > > and uncorrelated.) In damage tolerance, the individual faults are
> sharply
> > > correlated in time and place, and are often massive in number. The
> classic
> > > military example is a weapon strike. In the commercial world, a major
> > power
> > > failure is a good example. Damage tolerance is considered much harder
to
> > > accomplish than fault tolerance. If you have damage tolerance, you
also
> > > have fault tolerance, but fault tolerance does not by itself confer
> damage
> > > tolerance.
> > >
> > > How is this Damage Tolerance achieved?  All changes in LAN segment
> > topology
> > > (the loss or gain of nodes (NICs), hubs, or fibers) are detected in
MAC
> > > hardware by the many link receivers, which report both loss and
> > acquisition
> > > of modulated light. This surveillance occurs all the time on all
links,
> > and
> > > is independent of data traffic. Any change in topology provokes the
> > > hardware into "rostering mode", the automatic exploration of the
segment
> > > using a flood of special "roster" packets to find the best path, where
> > > "best" is defined as that path which includes the maximum number of
> nodes
> > > (NICs).
> > >
> > > Just how fault tolerant and damage tolerant is this scheme?  A segment
> > will
> > > work properly with any number of nodes and hubs, if sufficient fibers
> > > survive to connect them together, and will automatically configure
> itself
> > > into a working segment within a millisecond of the last fault. If the
> > > number of broken fibers is less than the number of hubs, all surviving
> > > nodes will remain accessible, regardless of the fault pattern. If the
> > > number of fiber breaks is equal to or greater than the number of hubs,
> > > there is a simple equation to predict the probability of loss of
access
> to
> > > a typical node due to loss of hubs and/or fibers, given only the
number
> of
> > > hubs and the probability of any fiber breaking: Pnd[p,r]=
((2p)(1-p))^r,
> > > where p is the probability of fiber breakage and r is the number of
> > > surviving hubs (which ranges from zero to four in a quad system). This
> > > equation is exact (to within 1%) for fiber breakage probabilities of
33%
> > or
> > > less, and applies for any number of hubs.
> > >
> > > The simplicity of this equation is a consequence of the simplicity of
> this
> > > protocol, which is currently implemented in standard-issue FPGAs (not
> > > ASICs), and works without software intervention.  It can also be
> > > implemented in firmware.
> > >
> > > To give a numerical example, in a 33-node 4-hub segment, loss of 42
> fibers
> > > (16% of the segment's 264 fibers) would lead to only 0.5% of the nodes
> > > becoming inaccessible, on average. Said another way, after 42 fiber
> > breaks,
> > > there are only five chances out of a thousand that a node will become
> > > inaccessible. This is very heavy damage, with one fiber in six broken.
> To
> > > take a more likely example, with three broken fibers, all nodes will
be
> > > accessible, and with four broken fibers, there is less than one chance
> in
> > a
> > > million that a node will become inaccessible. Recovery takes two ring
> tour
> > > times plus settling time (electrical plus mechanical), typically less
> than
> > > one millisecond in ship-size networks, measured from the last fault.
> > > Chattering and/or intermittent faults can be handled by a number of
> > > mechanisms, including delaying node entry by up to one second. Few
> current
> > > LAN technologies approach this degree of resilience, or speed of
> recovery.
> > >
> > > In commercial systems and some military systems, a dual-ring solution
is
> > > sufficient.  Up to quad-ring solutions are comercially available,
needed
> > > for some military systems.  However, the ability to support up to quad
> > > redundant systems should be provided in 10GbE, for two reasons.
First,
> > > quad is needed for the military market, which may be economically
> > > significant in the early years of 10GbE.  Second, quad provides a
clear
> > > growth path and a way to reassure non-military customers that their
most
> > > stringent problems can be solved: One can ask them if their needs
really
> > > exceed those of warships duelling with supersonic missiles.
> > >
> > > The basic technical document, the RTFC Principles of Operation, is on
> the
> > > GbE website as "http://grouper.ieee.org/ groups/802/3/
10G_study/public/
> > > email_attach/ gwinn_1_0699.pdf" and "http://grouper.ieee.org/
> > > groups/802/3/10G_study/ public/ email_attach/ gwinn_2_0699.pdf".   I
was
> a
> > > member of the team that developed the technology, and am the author of
> > > these documents.
> > >
> > > Although these documents assume RTFC, a form of distributed shared
> memory,
> > > the basic rostering technology can easily be adapted for Gigabit and
> > > Ten-Gigabit Ethernet as well.  For nontechnical reasons, RTFC
originally
> > > favored smart nodes connected via dumb hubs.  However, the overall
> design
> > > can be somewhat simplified if one goes the other way, to dumb nodes
and
> > > smart hubs.  This also allows the same dumb nodes to be used in both
> > non-FT
> > > and FT networks, increasing node production volume, and does not force
> > > users to throw nodes away to upgrade to FT.
> > >
> > > I therefore would submit that 10GbE would greatly benefit from fault
> > > tolerance, and also that it's very easily achieved if included in the
> > > original design of 10GbE.
> > >
> > > Joe Gwinn