Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: [EFM-OAM] Notes from yesterday's call




Brian Arnold wrote:
> 
> Dan, all,
> 
> I'm guessing Matt is talking about a counter that contains a sum of the 
> number of TLVs received of each type, not a sum of the actual counts 
> inside each TLV.   That right, Matt?
> 

That was my intent yes, a count for each TLV type (how many times its 
happened).  To Dan's question (why is it more complete), the way the 
events are currently defined, we include in the event
  - the threshold (i.e. needed to get > 10 bad things to happen)
  - the window (needed them to happen in 1 second)
  - the value that caused the event (14 of them happened in this second)
If we just had a counter for the number of times the event was received, 
we would lose the detailed information and just know that the other guy 
generated the event for some reason (we wouldn't necessarily know his 
threshold or window sizes).  Everyone can make up their own mind on how 
important that is, but it would get lost if we kept less info.

> For what it's worth, I submitted a comment based on what Jonathon 
> Thatcher suggested back in February.  It adds a new field to the three 
> error_XXX TLVs (not the summary TLV) that is the running count of errors 
> of that type that have exceeded the threshold since OAM initialization.  
> They'd overflow and be non-resettable.  Basically it's a sum of the 
> values populated in each error_XXX TLV that's been generated and sent.  
> The hope was this could help address the concern from the last 
> conference call that information could be lost due to the objects only 
> containing the "last seen" PDU info.
> 

Which is another alternative, and can be considered as a separate 
suggestion from whether we keep the last event of each type available 
via C30, or whether we just keep a count of the number of times an event 
of each type was received.


> The idea was that if the information recipient received all the event 
> PDUs but didn't service the MIB objects for every one (or even lost some 
> PDUs along the way), the error count information sent in event PDUs 
> wouldn't be lost.  Since each error TLV has a timestamp (consensus of 
> 2nd conf. call), the running counter and timestamp provide a simple way 
> to keep track of the number of errors that have exceeded the threshold 
> over a known period of time, whether you check after every received 
> event PDU or every few seconds (assuming you can deal with the overflow).

True.  But if the complaint was that a management application could not 
keep up with frequent events happening, this doesn't solve that problem. 
   I guess you're saying that its ok if it can't keep up with each 
individual event, because if it misses a few, you'll throw in extra data 
to help it know that "more stuff" happened.

> 
> I thought Jonathon's idea was helpful and reasonable.  I didn't remember 
> seeing it in comments on drafts since then so I entered it.  Hope I 
> didn't distort it too badly, but does this help fix some of the concerns?
> 
> Brian
> 
> 
> At 12:27 PM 4/30/2003 +0300, Romascanu, Dan (Dan) wrote:
> 
>> Matt,
>>
>> As long as we define counters for each event type and assuming these 
>> counters are accurate, why would be solution b) (an event table) 'more 
>> reliable and complete' than a) (a MIB group of counters per event type)?
>>
>> Dan
>>
>>
>> > -----Original Message-----
>> > From: Matt Squire [mailto:mattsquire@acm.org]
>> > Sent: Wednesday, April 30, 2003 5:31 AM
>> > To: Jonathan Thatcher
>> > Cc: Brian Arnold; stds-802-3-efm-oam@ieee.org
>> > Subject: Re: [EFM-OAM] Notes from yesterday's call
>> >
>> >
>> >
>> >
>> > Thats a great point.  As a group, we have to give proper
>> > consideration
>> > to the options before us on the subject of event
>> > notification.  We could
>> >   a) Just have counters so that we know how many notifications we've
>> > received for every event type
>> >   b) Just have the event tables (as in D1.414), which may result in
>> > missed information, but the information there will be
>> > reliable and more
>> > complete than a simple counter
>> >   c) Provide both - neither is perfect and the combination is better
>> > than the parts
>> >   d) Find a different way (don't know what this is).
>> >
>> > We touched on a couple of these options in the past, but I don't know
>> > that there's been clear consensus behind any path.
>> >
>> > Thougts?
>> >
>> > - Matt
>> >
>> > Jonathan Thatcher wrote:
>> > > It will be hard to convince network managers to use a
>> > something that they know to be unreliable.
>> > >
>> > > But, there are two kinds of unreliability: unreliable
>> > communication (acceptable if rare), unreliable information
>> > (absolutely unacceptable). Or, I think that we can sell the
>> > fact that some information might not "make it through." But,
>> > that which does get through must be known good. A related
>> > point is that with redundant communication (to increase the
>> > probability that information does "make it through"), there
>> > can be no case where the redundancy confuses the information
>> > recipient.
>> > >
>> > > jonathan