Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Fw: min / max and empty intervals - Entire and Missing Data




From:


Ralph Baker Kearfott <rbk@xxxxxxxxxxxx>

To:

Ian McIntosh/Toronto/IBM@IBMCA

Date:

06/09/2011 02:11 AM

Subject:

Re: min / max and empty intervals



On 6/9/2011, Ralph Baker wrote:
>On 6/7/2011 7:07 AM, Arnold Neumaier wrote:
>> Dan Zuras Intervals wrote:
>>>> Date: Tue, 07 Jun 2011 08:43:03 +0200
>>>> From: Arnold Neumaier <Arnold.Neumaier@xxxxxxxxxxxx>
>>>> To: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
>>>> CC: John Pryce <j.d.pryce@xxxxxxxxxxxx>, stds-1788@xxxxxxxxxxxxxxxxx
>>>> Subject: Re: min / max and empty intervals
>>>>
>>>> Dan Zuras Intervals wrote:
>
>>> Actually, John touched on the reasonable application for
>>> NaNs in a matrix. That of as yet unknown data.
>>
>> Unknown values represented in interval analysios by Entire, not by Empty!
>
>I personally agree with Arnold on this point.  It is tied to the basic
>philosophy underlying interval analysis.
>
>Baker

I understand and agree with the reasons that unknown should normally be Entire, not Empty (and BTW, one of the multiple meanings of NaN is equivalent to Entire).

At the same time, a standard is better if it can apply to diverse situations, not just the usage that led to it, so we should think about other potential applications and their implications.

Suppose I had a large set of data and wanted to do some analysis on it, but many values were unknown. If unknown is represented as Entire and I use the obvious approach, then most of my answers will be Entire and I will know nothing except that I know nothing. If I skip unknown values then I can produce answers tighter than Entire, and I will know something, with the caveat that the answers are not guaranteed to be correct. I may consider that useful.

Let's take a concrete example. For the set of all adults in the USA on July 1st 2011, measure their height. Since there's some measurement error and height can vary throughout the day, use intervals with reasonable bounds.

Now find out the minimum and maximum heights. No problem.

But what if you only have data for 1% of the people? If you treat the unknowns as Entire, the minimum height is -oo and the maximum +oo. Ruling out negatives still doesn't give a useful answer. If you treat the unknowns as "Ignore this unknown value", then you get minimum and maximum heights for the people you have data for. You can't claim that the answers are exactly what you were asked, but you can say they are correct for the 1% subset of the cases you have data for, and if you know statistics you may say that the true answers should not be a much larger range than the answers for this subset.

There are many real applications where intervals could be useful if missing data can be ignored and the limitations are understood as part of the results.

So here are my questions: Can we define a decoration for "missing data" or "unknown", and decoration operations which when encountering that produce "some data is missing" or "some data is unknown"? Is it better to define specific operations to be used in such cases (eg, max_known_value)? Can either or both of those be done in a consistent way? Can they be done without damaging other things? Would that increase (or decrease?) the usefulness of the standard?

- Ian McIntosh IBM Canada Lab Compiler Back End Support and Development