[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: This month's agenda & my rant on reductions...



From your description of the bio problem, it sounds like nearly all the primitive operations would trap nearly all the time. Is that so? If so, I don't see why the reductions proposals would improve matters. Our Mill hardware supports reductions directly, but that wouldn't help on this one. We too would promptly get a NaN/inf and that's all she wrote, despite the special performance hardware.

Of course, if you are presuming that a reduction delivers its result with only one rounding for the whole thing then the bio problem is solved. But the rejected proposal did not call for that, and you would be burnt at the stake if you proposed it :-)

Ivan

r754@xxxxxxxxxxxxxx wrote:
	The IEEE-754 meetings this month will be at Sun in Menlo Park
	from 1:00 to 5:00.  Please come to the lobby of building 14 on
	that campus.  Dave has already given directions & phone numbers.

	The Tuesday 8/16 subcommittee meeting will be held in the San
	Gregorio Room in building 16.  We will be discussing the pre-
	ballot study drafts as well as issues arising from the failure
	of many proposals to pass last month.  (More on that below.)

	The draft review will be held Wednesday 8/17 in the Pulgas Water
	Temple Room in building 16.  The topic will be the incorporation
	of the text of those issues that passed last month.

	The general meeting will be held Thursday 8/18 in the Sequoia
	Room in building 14.  We will spend most of the meeting on active
	ballot issues as well as an hour set aside for John Crawford to
	present some decimal performance data for discussion.
 
	Now, I would like to mention the difficulty that arises from the
	failure of many of last month's _expression_ proposals to pass.

	These are not the whims of the subcommittee or a laundry list of
	things we think are cool.  These are part of a larger whole that
	is our attempt to deal with many of the more difficult problems
	we have been unable to resolve in the last 5 years.  There is a
	valid criticism that many are vague or under specified but that
	is intentional.  We are attempting to leave room for implementers
	to innovate these functions for performance & accuracy in ways
	we have not yet conceived & cannot legislate.

	Let me illustrate with a single example that I ran across this
	week.

	I was involved with the BioInformatics conference held at
	Stanford this week.  It turns out that BioInformaticians are
	either biologists that use computers in their research or
	computer scientists that are interested in developing software
	for those biologists to use.

	Anyway, in the 3 days I was at the conference, just by chance,
	I talked to a graduate student at USF & a new professor at Saint
	Andrews in Scotland, both of whom had similar problems.  They
	were modelling the _expression_ of certain classes of genes using
	a variant of a Markov model to create a search model that could
	search large databases & recognise similar genes in new contexts.
	A cursory discussion suggested that they were having problems
	with overflow & underflow generating spurious infinities & NaNs.

	The graduate student was kind enough to sit down with me & we
	debugged her problem over the course of a day & a half or so.
	While it turned out the be a little more complex than I initially
	suspected, limited dynamic range was the problem.

	You see, in Markov modelling, one creates a finite state machine
	that operates on a graph that is generally a DAG.  (There are
	some slight provisions for looping but only from special nodes
	to themselves.)  The arcs of the graph are labeled with the
	probability of taking that path & the nodes are labeled with the
	probability of arriving there.  There are also output characters
	(in this case, amino acids) that are annotated with the
	probability of using them.  The figure of merit one tries to
	optimize is the probability of strings you would like to accept
	(in this case, from a training set of proteins that are all
	similar in some meaningful way) over the probability of those
	same strings arising just by chance (the null hypothesis).  The
	algorithm is to run what amounts to a dynamic programming matrix
	over the graph from beginning to end & then from end to
	beginning, adjusting the probabilities until some convergence
	criterion is met on the training set of proteins.  Then you run
	it on some huge database expecting that similar genes will be
	labeled with high probabilities.

	(For reference, there are 20 amino acids.  In the absence of any
	knowledge of its use, the probablity of any given chain of 250
	of them in a protein must be assumed to be (1/20)^250 or about
	10^-325.  Ratios of such probabilities, and longer ones that
	arise, get into the 10^(+/-20000) range.)

	In the course of this training, the calculation that wants to be
	done is ratios of sums of products of probabilities.  (Sound
	familiar to anyone?)  The added complication is that the
	biologists recognise that the dynamic range of the ratios of
	probabilities is so huge as to blow out the range of double
	precision numbers.  Their solution to this problem, to store
	some of the probabilities as logarithms, only partially
	attenuates the problem.  They are forced to convert back to
	probability ratios in the course of updating their figure of
	merit & that's when over/underflows bite them in the ass with
	the delivery of NaNs & infinities.  I was able to suggest a
	slight modification of the calculation that stays in logs of
	probabilities (what they call log-odds) but that too is only a
	partial solution.  They are still left with the problem that
	computing with the logs in this way is inherently much slower &
	more innacurate than just multiplying the probabilities would
	be.

	The REAL solution is to provide them with the ability to take
	long products & sums & not worry about overflow or underflow.

	And THIS is exactly what was shot down last month in the form
	of the reductions proposal as well as others.

	For us, part of the justification for including these functions
	was that, by including them, we could eliminate the need for
	counting mode.  This is the only known application for it & we
	felt it was much better to provide the functions themselves than
	to mandate a mode that is only a hack when it comes to solving
	the problem.  Without the reductions we are forced to consider
	counting mode again.

	The application that Prof Kahan mentioned, namely Klebsh-Gordan
	or Wigner Coefficients, did not strike many members of the
	committee as important enough to justify such reductions.

	Perhaps that is so.  But I have now discovered that there are
	people out there who need these functions in their 'daily
	lives'.  These people are NOT numerical analysists & I think we
	do them a disservice if we force them to learn our profession.
	Beware: Soon the biologists will have the power to take their
	revenge on us in nasty & disgusting ways. :-)

	Well, I guess its time to get off my soapbox & let someone
	point out that our laundry list is arbitrary or vague.  Well
	I urge you to attend the subcommittee meeting Tuesday where
	I would like to amend the reductions proposal to meet your
	objections & propose a modified version again next month.

	Because I'm certain you can't make this problem go away by
	ignoring it.

	Dying might help.  But ignoring won't work.


				Dan

  

754 | revision | FAQ | references | list archive