[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

This month's agenda & my rant on reductions...



        The IEEE-754 meetings this month will be at Sun in Menlo Park   from 
1:00 to 5:00.  Please come to the lobby of building 14 on  that campus.  Dave 
has already given directions & phone numbers.
        The Tuesday 8/16 subcommittee meeting will be held in the San   
Gregorio Room in building 16.  We will be discussing the pre-   ballot study 
drafts as well as issues arising from the failure  of many proposals to pass 
last month.  (More on that below.)
        The draft review will be held Wednesday 8/17 in the Pulgas Water        
Temple Room in building 16.  The topic will be the incorporation        of the 
text of those issues that passed last month.
        The general meeting will be held Thursday 8/18 in the Sequoia   Room in 
building 14.  We will spend most of the meeting on active       ballot issues 
as well as an hour set aside for John Crawford to present some decimal 
performance data for discussion.   Now, I would like to mention the difficulty 
that arises from the        failure of many of last month's expression 
proposals to pass.
        These are not the whims of the subcommittee or a laundry list of        
things we think are cool.  These are part of a larger whole that        is our 
attempt to deal with many of the more difficult problems we have been unable to 
resolve in the last 5 years.  There is a valid criticism that many are vague or 
under specified but that is intentional.  We are attempting to leave room for 
implementers       to innovate these functions for performance & accuracy in 
ways  we have not yet conceived & cannot legislate.
        Let me illustrate with a single example that I ran across this  week.
        I was involved with the BioInformatics conference held at       
Stanford this week.  It turns out that BioInformaticians are    either 
biologists that use computers in their research or       computer scientists 
that are interested in developing software  for those biologists to use.
        Anyway, in the 3 days I was at the conference, just by chance,  I 
talked to a graduate student at USF & a new professor at Saint        Andrews 
in Scotland, both of whom had similar problems.  They   were modelling the 
expression of certain classes of genes using a variant of a Markov model to 
create a search model that could search large databases & recognise similar 
genes in new contexts.       A cursory discussion suggested that they were 
having problems   with overflow & underflow generating spurious infinities & 
NaNs.
        The graduate student was kind enough to sit down with me & we   
debugged her problem over the course of a day & a half or so.   While it turned 
out the be a little more complex than I initially       suspected, limited 
dynamic range was the problem.
        You see, in Markov modelling, one creates a finite state machine        
that operates on a graph that is generally a DAG.  (There are   some slight 
provisions for looping but only from special nodes  to themselves.)  The arcs 
of the graph are labeled with the     probability of taking that path & the 
nodes are labeled with the        probability of arriving there.  There are 
also output characters        (in this case, amino acids) that are annotated 
with the probability of using them.  The figure of merit one tries to    
optimize is the probability of strings you would like to accept (in this case, 
from a training set of proteins that are all     similar in some meaningful 
way) over the probability of those   same strings arising just by chance (the 
null hypothesis).  The algorithm is to run what amounts to a dynamic 
programming matrix        over the graph from beginning to end & then from end 
to beginning, adjusting the probabilities until some convergence   criterion is 
met on the training set of proteins.  Then you run it on some huge database 
expecting that similar genes will be   labeled with high probabilities.
        (For reference, there are 20 amino acids.  In the absence of any        
knowledge of its use, the probablity of any given chain of 250  of them in a 
protein must be assumed to be (1/20)^250 or about  10^-325.  Ratios of such 
probabilities, and longer ones that    arise, get into the 10^(+/-20000) range.)
        In the course of this training, the calculation that wants to be        
done is ratios of sums of products of probabilities.  (Sound    familiar to 
anyone?)  The added complication is that the        biologists recognise that 
the dynamic range of the ratios of    probabilities is so huge as to blow out 
the range of double     precision numbers.  Their solution to this problem, to 
store    some of the probabilities as logarithms, only partially attenuates the 
problem.  They are forced to convert back to     probability ratios in the 
course of updating their figure of    merit & that's when over/underflows bite 
them in the ass with   the delivery of NaNs & infinities.  I was able to 
suggest a     slight modification of the calculation that stays in logs of    
probabilities (what they call log-odds) but that too is only a  partial 
solution.  They are still left with the problem that    computing with the logs 
in this way is inherently much slower & more innacurate than just multiplying 
the probabilities would   be.
        The REAL solution is to provide them with the ability to take   long 
products & sums & not worry about overflow or underflow.
        And THIS is exactly what was shot down last month in the form   of the 
reductions proposal as well as others.
        For us, part of the justification for including these functions was 
that, by including them, we could eliminate the need for    counting mode.  
This is the only known application for it & we  felt it was much better to 
provide the functions themselves than        to mandate a mode that is only a 
hack when it comes to solving  the problem.  Without the reductions we are 
forced to consider  counting mode again.
        The application that Prof Kahan mentioned, namely Klebsh-Gordan or 
Wigner Coefficients, did not strike many members of the      committee as 
important enough to justify such reductions.
        Perhaps that is so.  But I have now discovered that there are   people 
out there who need these functions in their 'daily       lives'.  These people 
are NOT numerical analysists & I think we do them a disservice if we force them 
to learn our profession.  Beware: Soon the biologists will have the power to 
take their   revenge on us in nasty & disgusting ways. :-)
        Well, I guess its time to get off my soapbox & let someone      point 
out that our laundry list is arbitrary or vague.  Well    I urge you to attend 
the subcommittee meeting Tuesday where     I would like to amend the reductions 
proposal to meet your      objections & propose a modified version again next 
month.
        Because I'm certain you can't make this problem go away by      
ignoring it.
        Dying might help.  But ignoring won't work.


                                Dan

754 | revision | FAQ | references | list archive