[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
This month's agenda & my rant on reductions...
The IEEE-754 meetings this month will be at Sun in Menlo Park from
1:00 to 5:00. Please come to the lobby of building 14 on that campus. Dave
has already given directions & phone numbers.
The Tuesday 8/16 subcommittee meeting will be held in the San
Gregorio Room in building 16. We will be discussing the pre- ballot study
drafts as well as issues arising from the failure of many proposals to pass
last month. (More on that below.)
The draft review will be held Wednesday 8/17 in the Pulgas Water
Temple Room in building 16. The topic will be the incorporation of the
text of those issues that passed last month.
The general meeting will be held Thursday 8/18 in the Sequoia Room in
building 14. We will spend most of the meeting on active ballot issues
as well as an hour set aside for John Crawford to present some decimal
performance data for discussion. Now, I would like to mention the difficulty
that arises from the failure of many of last month's expression
proposals to pass.
These are not the whims of the subcommittee or a laundry list of
things we think are cool. These are part of a larger whole that is our
attempt to deal with many of the more difficult problems we have been unable to
resolve in the last 5 years. There is a valid criticism that many are vague or
under specified but that is intentional. We are attempting to leave room for
implementers to innovate these functions for performance & accuracy in
ways we have not yet conceived & cannot legislate.
Let me illustrate with a single example that I ran across this week.
I was involved with the BioInformatics conference held at
Stanford this week. It turns out that BioInformaticians are either
biologists that use computers in their research or computer scientists
that are interested in developing software for those biologists to use.
Anyway, in the 3 days I was at the conference, just by chance, I
talked to a graduate student at USF & a new professor at Saint Andrews
in Scotland, both of whom had similar problems. They were modelling the
expression of certain classes of genes using a variant of a Markov model to
create a search model that could search large databases & recognise similar
genes in new contexts. A cursory discussion suggested that they were
having problems with overflow & underflow generating spurious infinities &
NaNs.
The graduate student was kind enough to sit down with me & we
debugged her problem over the course of a day & a half or so. While it turned
out the be a little more complex than I initially suspected, limited
dynamic range was the problem.
You see, in Markov modelling, one creates a finite state machine
that operates on a graph that is generally a DAG. (There are some slight
provisions for looping but only from special nodes to themselves.) The arcs
of the graph are labeled with the probability of taking that path & the
nodes are labeled with the probability of arriving there. There are
also output characters (in this case, amino acids) that are annotated
with the probability of using them. The figure of merit one tries to
optimize is the probability of strings you would like to accept (in this case,
from a training set of proteins that are all similar in some meaningful
way) over the probability of those same strings arising just by chance (the
null hypothesis). The algorithm is to run what amounts to a dynamic
programming matrix over the graph from beginning to end & then from end
to beginning, adjusting the probabilities until some convergence criterion is
met on the training set of proteins. Then you run it on some huge database
expecting that similar genes will be labeled with high probabilities.
(For reference, there are 20 amino acids. In the absence of any
knowledge of its use, the probablity of any given chain of 250 of them in a
protein must be assumed to be (1/20)^250 or about 10^-325. Ratios of such
probabilities, and longer ones that arise, get into the 10^(+/-20000) range.)
In the course of this training, the calculation that wants to be
done is ratios of sums of products of probabilities. (Sound familiar to
anyone?) The added complication is that the biologists recognise that
the dynamic range of the ratios of probabilities is so huge as to blow out
the range of double precision numbers. Their solution to this problem, to
store some of the probabilities as logarithms, only partially attenuates the
problem. They are forced to convert back to probability ratios in the
course of updating their figure of merit & that's when over/underflows bite
them in the ass with the delivery of NaNs & infinities. I was able to
suggest a slight modification of the calculation that stays in logs of
probabilities (what they call log-odds) but that too is only a partial
solution. They are still left with the problem that computing with the logs
in this way is inherently much slower & more innacurate than just multiplying
the probabilities would be.
The REAL solution is to provide them with the ability to take long
products & sums & not worry about overflow or underflow.
And THIS is exactly what was shot down last month in the form of the
reductions proposal as well as others.
For us, part of the justification for including these functions was
that, by including them, we could eliminate the need for counting mode.
This is the only known application for it & we felt it was much better to
provide the functions themselves than to mandate a mode that is only a
hack when it comes to solving the problem. Without the reductions we are
forced to consider counting mode again.
The application that Prof Kahan mentioned, namely Klebsh-Gordan or
Wigner Coefficients, did not strike many members of the committee as
important enough to justify such reductions.
Perhaps that is so. But I have now discovered that there are people
out there who need these functions in their 'daily lives'. These people
are NOT numerical analysists & I think we do them a disservice if we force them
to learn our profession. Beware: Soon the biologists will have the power to
take their revenge on us in nasty & disgusting ways. :-)
Well, I guess its time to get off my soapbox & let someone point
out that our laundry list is arbitrary or vague. Well I urge you to attend
the subcommittee meeting Tuesday where I would like to amend the reductions
proposal to meet your objections & propose a modified version again next
month.
Because I'm certain you can't make this problem go away by
ignoring it.
Dying might help. But ignoring won't work.
Dan