Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

(long) 'severe' problems, was Re: Decorated intervals ARE intervals...



> Date: Sun, 03 Jul 2011 15:20:35 -0500
> From: Ralph Baker Kearfott <rbk@xxxxxxxxxxxx>
> To: John Pryce <j.d.pryce@xxxxxxxxxxxx>
> CC: stds-1788 <stds-1788@xxxxxxxxxxxxxxxxx>
> Subject: Re: Decorated intervals ARE intervals...

	Ralph et al,

	This note is verbose & I buried the lead down near the
	bottom.  Suffice it to say that I don't think our choice
	is as stark as failing to assure computing versus slow
	17-byte objects.  I think we can have 16 & 17-byte objects
	that are both fast & correct but you have to read the long
	note to see why I think so.

> 
> John, P-1788,
> 
> Can someone clarify what "severe storage and communication penalties"
> means?
> For example, my naive software implementation of decorations
> would be to tack on an integer (in whatever format is most efficient
> for the machine) to the two doubles.  If it is a 2-byte integer, the
> storage penalty would be 2/(16+2) < 12 percent.  Tacking on an extra
> 8 bytes (wasteful from the point of view of storage, but conceivably
> logical to do in some circumstances), the storage penalty would be
> < 34 percent.

	I suppose I should weigh in on this one.

	I will let you judge the use of the word 'severe' in this.
	I suspect both the memory & performance hits are bounded
	by a factor of 2 on the upside.

	Let me divide my observations into two parts:  (1) Those
	machines that exist today & (2) Those hardware extensions
	we all hope to find supporting intervals in the future.

	Part (1):

	Machines today get the performance they do by moving data
	around quickly.  From main memory to cache, from cache to
	registers, from registers to ALUs.  Each machine architecture
	approaches the problem slightly differently but, as my old
	friend Willy used to say, when you want to increase your
	bandwidth you have to either increase your band or increase
	your width.  There are no other choices & you cannot 'clever'
	your way around the problem.

	Assuming you have done all that current physics & technology
	permits you in cranking your clock up to get your 'band',
	you are left with increasing your width.  That means making
	busses wider.  And busses are expensive.  These days they
	take up more real estate than the transisters they talk to.
	(Except for cache transisters.  We have some mighty big
	on-chip caches these days.)

	Even on so called 32-bit machines, the typical buss width
	between a floating-point register & its ALU is 64-bits.
	(32-bit floating-point operations will either use half the
	bus or try to double up the operations.)  There are between
	5 & 11 such busses.  (Register files have many read/write
	ports in the parallel ALU & deep pipe machines that we
	currently enjoy.)

	The busses between registers & cache may or may not be a
	factor of 2 to 4 wider & will certainly handle many piped
	transfers in a row.  There may or may not be a second or
	even a third level cache into which the busses will either
	get wider still or support more operations sequentially.

	Busses between on-chip caches & memory are very wide indeed.
	But here is where Willy's observation comes in:  One can
	trade wires for latency if one is willing to tie up the bus
	for some number (4 to 8 or more) read/write cycles.  And
	different architectures solve this problem differently.

	The extreme at one end is the low-power hand-held embedded
	computer with narrow busses & off loaded sequential data
	transfer bursts.

	The extreme at the other end originated with Seymour Cray
	(then at CDC) who balanced the bandwidth all the way from
	the ALUs to main memory.  If the caches were twice as slow
	as the registers, the busses were twice as wide.  And (in
	his case) if the main memory was 128 times slower than ALU
	operations then it was 128 times wider.  Thus CDC-6400s &
	Cray-1's could work just as fast on large data sets as
	small ones if the programmer was clever enough to keep all
	these busses active.

	Typical machines today are somewhere in between &, sad to
	say, closer to the former than the latter.  But this is
	mostly for reasons of cost & as the cost comes down things
	get more parallel.  I think we are about to enter an era
	of very wide & very parallel machines with truly obscene
	performance.

	OK, this part of my opus was long & involved for the
	following reason:  You should observe, as several people
	have pointed out, that moving around non-power-of-2 data
	objects in systems designed for power-of-2 objects doesn't
	fit.  But you should also observe that the worst mismatch
	is right at the top, between the registers & the ALUs.

	At that level, at 17-byte object might take 3 transfers
	rather than 2.  Farther down the problem may or may not be
	attenuated by your alignment policy.  Again, architectures
	fall into two different classes.

	There are machine architectures that allow many byte objects
	to be aligned on any byte address by use of elaborate byte
	rotators on entry & exit from (otherwise) aligned busses.
	On such machines the best policy would be to go ahead &
	pack up objects every 17 bytes & save on main memory.

	But more typical machines (cf Intel & clones) require that
	a 2^n-byte object be aligned & addressed on 2^n-byte
	boundaries.  On these machines the best policy might be to
	go ahead & blow some memory & align our objects on 32-byte
	boundaries.  The expense is not the (cheap) memory but the
	(rather more expensive) use of memory bandwidth to ferry
	around data we are not using.

	Part (2), Future machines:

	We all hope that the hardware community will embrace 1788
	with open arms & do their very best to find interesting &
	innovative ways to run interval arithmetic as fast as their
	little transisters will carry it.

	Maybe someday.  But not today.  And not, I suspect, on their
	first attempt to support 1788.

	For you see the choice is NOT poorly packed 17-byte objects
	versus failing to assure the computation.  There is a 3rd &
	even a 4th alternative to this seemingly stark choice.

	The 3rd choice is to count on optimizing compilers to know
	when decoration results are not needed.

	This is in direct analogy to when 754 flags & modes are or
	are not needed.  The past 3 decades have taught us that
	virtually all the time they are NOT needed.  And in the
	remaining cases they are ABSOLUTELY necessary.

	Since we had hardware (8087) almost from the start, there
	was not much to be gained by looking for & eliminating the
	need for flags & modes.

	But we have much to gain in 1788 & we should give the
	compilers as much help as we can to aid them in this.
	An optimizing compiler that can tell when the decorations
	are never referenced in a result (or, indeed, an entire
	program) can safely eliminate both the computation of
	decorations & the storage for them.

	Thus, 16-byte objects become the norm when it is known
	that assured computing does not depend on decorations.
	And 16-byte busses that serve specialized interval ALUs
	would be about as good as it gets.

	The 4th alternative is something that lives at level 1
	which we have not thought much about at the lower levels.

	You see, just because we associate a decoration with each
	interval doesn't mean we have to STORE THEM IN THE SAME
	PLACE.

	Another approach a good compiler might use could be to
	place all interval parts in one place (say offset from
	a 128-bit aligned memory location called $INTERVALPART),
	& put all decorations in ANOTHER place (say offset from
	the memory location $DECORATION).

	Thus the i'th interval object may be referenced by finding
	its interval part at $INTERVALPART + i<<4 & its decoration
	at $DECORATION + i (or + i<<1, as in the case described by
	Ralph).

	Memory is 100% packed up.  The busses are 100% used to
	transfer data rather than fluff.  And the bandwidth for
	computation is the best that money can buy.

	And these 16-byte busses feeding interval ALUs can be
	supported by 1-byte busses on the side that manage the
	decorations as well.  (Chances are both will happen
	through overloading existing floating-point & integer
	busses for these purposes.  But that's a detail not
	really important to the argument.)

	All this is possible so long as we don't do anything to
	prevent it.  So far, we're OK.  Let's keep it that way
	by keeping our specifications as much at level 1 as we
	can manage.

	I believe the 3rd alternative will be by far the most used.
	The 4th is only for those cases for which the information
	in the decoration is absolutely necessary.  A tiny but
	important minority of the time, I suspect.

> 
> In my own work, reliability and predictability of the interval
> operations has usually (but not necessarily always)
> trumped speed of the raw arithmetic.  What is the experience of
> others?
> 
> Baker
> 

	OK, about that word 'severe'.  In any case the worst hit,
	either in space or time, is bounded on the upside by a
	factor of 2.  And our user has already agreed to a loss
	of at least a factor of two because she wants assured
	computing over speed or capacity.  Choosing from among
	the 3rd & 4th alternatives avoids that second hit & is
	about as good as it gets.

	But if we fail to give her assured computing by permitting
	unneeded 'optimizations', we have failed completely.

	Let's not do that.


				Dan


	P.S. - BTW, if people believe the 3rd & 4th alternatives
	are useful, it might be a good idea to make appropriate
	informative comments about that in the draft.  Just so
	our implementers don't have to discover it for themselves.