From:
Ralph Baker Kearfott <rbk@xxxxxxxxxxxx>
To:
Ian McIntosh/Toronto/IBM@IBMCA
Date:
06/09/2011 02:11 AM
Subject:
Re: min / max and empty intervals
On 6/9/2011, Ralph Baker wrote:
>On 6/7/2011 7:07 AM, Arnold Neumaier wrote:
>> Dan Zuras Intervals wrote:
>>>> Date: Tue, 07 Jun 2011 08:43:03 +0200
>>>> From: Arnold Neumaier <Arnold.Neumaier@xxxxxxxxxxxx>
>>>> To: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
>>>> CC: John Pryce <j.d.pryce@xxxxxxxxxxxx>, stds-1788@xxxxxxxxxxxxxxxxx
>>>> Subject: Re: min / max and empty intervals
>>>>
>>>> Dan Zuras Intervals wrote:
>
>>> Actually, John touched on the reasonable application for
>>> NaNs in a matrix. That of as yet unknown data.
>>
>> Unknown values represented in interval analysios by Entire, not by
Empty!
>
>I personally agree with Arnold on this point. It is tied to the basic
>philosophy underlying interval analysis.
>
>Baker
I understand and agree with the reasons that unknown should normally
be Entire, not Empty (and BTW, one of the multiple meanings of NaN is
equivalent to Entire).
At the same time, a standard is better if it can apply to diverse
situations, not just the usage that led to it, so we should think
about other potential applications and their implications.
Suppose I had a large set of data and wanted to do some analysis on
it, but many values were unknown. If unknown is represented as Entire
and I use the obvious approach, then most of my answers will be Entire
and I will know nothing except that I know nothing. If I skip unknown
values then I can produce answers tighter than Entire, and I will know
something, with the caveat that the answers are not guaranteed to be
correct. I may consider that useful.
Let's take a concrete example. For the set of all adults in the USA on
July 1st 2011, measure their height. Since there's some measurement
error and height can vary throughout the day, use intervals with
reasonable bounds.
Now find out the minimum and maximum heights. No problem.
But what if you only have data for 1% of the people? If you treat the
unknowns as Entire, the minimum height is -oo and the maximum +oo.
Ruling out negatives still doesn't give a useful answer. If you treat
the unknowns as "Ignore this unknown value", then you get minimum and
maximum heights for the people you have data for. You can't claim that
the answers are exactly what you were asked, but you can say they are
correct for the 1% subset of the cases you have data for, and if you
know statistics you may say that the true answers should not be a much
larger range than the answers for this subset.
There are many real applications where intervals could be useful if
missing data can be ignored and the limitations are understood as part
of the results.
So here are my questions: Can we define a decoration for "missing
data" or "unknown", and decoration operations which when encountering
that produce "some data is missing" or "some data is unknown"? Is it
better to define specific operations to be used in such cases (eg,
max_known_value)? Can either or both of those be done in a consistent
way? Can they be done without damaging other things? Would that
increase (or decrease?) the usefulness of the standard?
- Ian McIntosh IBM Canada Lab Compiler Back End Support and Development