Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Constructors & decorations



> Subject: Re: Constructors & decorations
> From: John Pryce <j.d.pryce@xxxxxxxxxxxx>
> Date: Tue, 2 Mar 2010 12:24:04 +0000
> To: stds-1788 <stds-1788@xxxxxxxxxxxxxxxxx>

	John, Vincent, Nate,

	In deference to the others & the consideration of
	Motion 11 at this time, I think we should take this
	conversation off line until that is done.

	Especially the issues of the meaning of the various
	decoration states & the rules for their propagation.

	But I will try to answer the hardware question below.

	Alas, even this single answer is a long one as I must
	go into the details of how modern deep pipe machines
	handle both state & exceptions.

	Those uninterested may safely delete this note.

> 
> Dan & P1788
> 
> On 23 Feb 2010, at 12:17, Dan Zuras Intervals wrote:
> > John Pryce had written:
> >> I believe not. I think the intention of Motion 8 is that once
> >> an interval is set notValid, _all_ its other attributes become
> >> meaningless. It has fallen into a black hole. So a V- interval
> >> is essentially the same as NaI. It follows that all query-functions
> >> on it should give results of maximum meaninglessness:
> >> - Its other decorations should be D0 C0 B0.
> >> - Its lower & upper bounds, midpoint, radius etc., should all return NaN.
> > 
> > 	So it seems that you are reserving the interpretation
> > 	of 'Valid' things to those things which are imported
> > 	into intervals from other arithmetics.  Constructors,
> > 	converters, & the like.
> 
> Yes.
> 
> > 	Does that not seem like a particularly narrow use for
> > 	a decoration?  Would not 'Defined' do as well?
> 
> The Motion 8.02 document says on p3:
> > -- inv is the decoration of an interval created by an invalid
> > conversion
> > ...
> > we may have ...
> > (X,ok) \union (Empty,inv) = (X,inv)
> > as an unambiguous version of X \union NaI = NaI.
> 
> This seems to me to agree with my interpretation.
> 
> 'Defined' has a quite different and specific purpose,
> to do with the applicability of fixed-point theorems.
> 
> > ...	I say this because decoration bits might end up being
> > 	expensive things some day.  If we are successful &
> > 	intervals find their way into hardware, decoration bits
> > 	will become state bits.  They will have interlocks &
> > 	busses devoted to them.  They will be saved on context
> > 	switch.  They will become involved in issues of
> > 	parallelism & cache coherency.  We should make sure
> > 	they are useful.
> 
> I am not a hardware guy so maybe I have the wrong end of the stick.
> But didn't Nate argue for decorations, in contrast to global flags,
> precisely on the grounds that
> - they are fast because they are attached to the individual data item,
> hence
> - they don't need to be in state bits
> - they don't need interlocks or busses
> - they are NOT part of a context switch
> - they don't affect parallelism & cache coherency more than any other
> kind of data
> 
> Regards
> 
> John

	You are largely correct in so far as GLOBAL state bits are
	concerned.

	But let's look at how one might implement hardware involving
	decorations attached to the intervals themselves.

	First, let me explain how modern machines handle exceptions
	in the first place.

	By analogy, here is how one might add two integers in hardware
	typical of today (& tomorrow):

		LDZ addr1,r5
		LDZ addr2,r6
		ZADD r5,r6,r7
		STZ r7,addr3

	The first two instructions resolve address offsets to memory
	locations, read (say) 32 bits from memory at that location,
	& move that information into integer registers designed to
	accept integers of that size.  The next instruction reads
	the previously stored information in those registers, passes
	them along busses built into the very heart of the CPU to an
	integer adder circuit which has an output buss leading back
	to the integer register file.  The store puts the result back
	in memory.

	Register bits are expensive.  They are at the very heart of
	what a computer does.  They are not like memory bits.  In
	particular, modern registers have multiple dedicated read &
	write ports to support parallel accesses simultaneously.
	For each port there will be a buss dedicated to routing the
	information on that port to/from the computation circuit it
	needs to go.  At the very minimum, 2 read ports & one write
	port to support the above code.  More often than not, 3 reads
	& 3 writes to support all those operations in parallel.  The
	last machine I worked on had 11 read & write ports.  And I
	have been retired for a decade now.

	So, register bits are expensive.

	They are also part of the process state.  If (when) the
	currently running process must be interrupted for some other
	process, the running of this process must be brought to some
	coherent halt, all the state associated with that process is
	stored into memory (including all the registers), & a new
	process is loaded into them to start another thread of
	execution.

	How much of this state must be swapped & the available bandwidth
	to memory is the thing that determines the cost of a context
	switch.

	This is the global state.

	But even with so-called local state, something similar happens.
	The Intel style parallel operations are designed to do short
	vector operations much like (but simpler than) Seymour Cray did
	years ago on the Cray machines.  The code still looks like:

		LDpZ addr1,pr5
		LDpZ addr2,pr6
		pZADD pr5,pr6,pr7
		STpZ pr7,addr3

	Only now it is carried out 4 at a time in dedicated vector
	registers.

	More state.  More expense.  More stuff to swap on context
	switch.  But it is considered worth it if the busses to & from
	memory are wide enough to support such operations in parallel.
	And, because such things are considerred common enough (in
	graphics code & the like) to justify the extra expense for the
	added performance gained.

	The heart of our issue here is how modern machines handle things
	like integer errors.  Each machine does it differently & advances
	continue to be made in this.  Let me just say that those advances
	are precisely those that attempt to eliminate integer errors from
	global state.  So, for example, overflows used to be stored in a
	dedicated overflow register set as a side effect of the add &
	tested after the add.  But in a modern machine this becomes an
	interlock issue.

	One of the things I have not yet mentioned is the large disparity
	between the CPU cycle time & the memory read time.  This ratio
	(mem-time/cycle-time) is large these days.  This has been dealt
	with by making the execution pipes correspondingly deep.  So, if
	I do load, load, add, store, the pipe is deep enough that the
	time taken to load the operands from memory is still less than
	the time the add spends in the pipe & the store can also be
	accomodated (possibly) in some other pipe.  Should the result
	of that add be used in some other register expression, there are
	dedicated bypass busses that take the output of the add (or the
	output of the store) & reroute it through the pipe to the
	instruction that needs it rather than wait for yet another
	cycle from memory.

	More dedicated hardware.  More expense.  But justified by the
	performance gained.

	The depth of this pipe can be 15 instructions or more.  The last
	machine I worked on was of length 25.  And I have been retired
	for a decade.  My friends at Intel were working on 35 instruction
	deep pipes at the time.  And the ratio of mem-time/cycle-time has
	only gotten larger.

	So, now you can see why storing the result of an integer overflow
	in a global overflow bit would be bad for a deep pipe machine.
	Any test on such a bit would amount to a conditional branch in
	the middle of this very deep pipe.  This means that the thread
	of execution must be brought to a halt & a new thread of
	instructions loaded & executed.  Given that many instructions
	are in a state of 'partial' or 'speculative' execution, the
	partial results of those instructions must be backed out &
	flushed to make way for the new list of instructions.  If (as
	is likely) there are multiple branches in the pipe, there are
	multiple possible threads that have to be managed.

	More complexity.  More expense.  More design time.  Less
	performance.

	In the days of global state bits & test & branch style code,
	the typical number of instructions executed before a conditional
	branch was encountered was 4.  All the expense & bother of very
	deep pipes are mostly wasted in this sort of test & branch
	architecture.

	And this cost is paid even if there are no tests or branches.
	You see, at some point in the execution of even branchless code,
	the pipe might have to be stopped IN ANTICIPATION of a branch in
	the next instruction.  Performance suffers & the effects of a
	deep pipe are lost long before the end of the pipe.

	So modern machines have found innovative ways of attaching the
	effects of a possible exception to the instruction itself.  So,
	for example, one might do

		ADD r5,r6,r7

	when one is unconcerned with the results of an overflow and

		ADDOVF r5,r6,r7,c3

	where the result of the overflow is stored in a condition bit
	(c3) and subsequent dependent operations are tagged with the
	moral equivalent of (do this instruction if c3 is (set/clear)).

	No branches.  No interlocks.  Some expense for the dedicated
	condition registers but much less cost in the complexity of the
	pipe.  We now need smart compilers to mitigate the effects of
	executing too many instructions that produce unneeded results.
	But moving this issue to the compiler is a big win.  Less global
	state.  Faster & less complicated hardware.

	OK, that's the preamble.

	But having laid that ground, the issue for dedicated interval
	hardware is no different.  We will likely have:

		LDII addr1,iir5
		LDII addr2,iir6
		IIADD iir5,iir6,iir7
		STII iir7,addr3

	in the typical case.  With decorations this implies a load
	which loads both the upper & lower bounds into dedicated
	registers as well as the associated decorations in THEIR OWN
	dedicated registers.  It implies a multi-port register file
	with dedicated busses to dedicated interval CPU hardware.
	As well as dedicated busses & decoration propagation circuits
	to handle the result decorations.  And, of course, a store
	that returns the decoration along with the interval result.

	But how would we handle a test on a decoration?  Say we are
	concerned that the result interval might be undefined (D-).
	If we were to test & branch on the decoration part of result
	register iir7, we would be in the same boat as the folks who
	used to test for integer overflow.  And the same sins would
	befall us.

	(This is the REAL global state issue that we need to avoid.)

	But, like the conditional execution model, we could replace

		IIADD iir5,iir6,iir7

	with

		IIADDD- iir5,iir6,iir7,iic3

	storing the results of an undefined condition in a condition
	bit to be used by subsequent instructions to predicate the
	results of their execution without the bother of a test &
	branch.  Or, if we consider the decoration register as if it
	were already a condition register, one might predicate
	subsequent operations on the state of the decoration of iir7.
	So

		IISUBD- (iir7),iir8,iir9,iir10

	That is, subtract 8 from 9 resulting in 10 IF the decoration
	state of 7 is D-.

	So, even though the state is not global, there is a cost to
	it.  A large non-zero cost.  Interval registers will be
	expensive.  More expensive than integers registers but
	perhaps less expensive than dedicated parallel floating-point
	registers.  Interval hardware will also be more expensive
	than integer hardware but less expensive than 4x parallel
	floating-point hardware.

	And, from a standards perspective, we must design our decoration
	state in such a way that dedicated decoration propagation
	hardware is easy to design & implement.

	(This last is the issue I have been thinking about but avoiding
	talking about for so long.  But I promise to talk to you guys
	offline in the near future & to the group as a whole after the
	idea has been thoroughly shat upon. :-)

	Now, let me repeat your points:

> 
> I am not a hardware guy so maybe I have the wrong end of the stick.
> But didn't Nate argue for decorations, in contrast to global flags,
> precisely on the grounds that
> - they are fast because they are attached to the individual data item,
> hence
> - they don't need to be in state bits
> - they don't need interlocks or busses
> - they are NOT part of a context switch
> - they don't affect parallelism & cache coherency more than any other
> kind of data
> 

	I hope you can see now that attaching decorations to the interval
	itself is only the first part (necessary but not sufficient) in
	making them 'local' or at least not global.  How the hardware
	interacts with them is also of critical importance.

	And, point by point,

	- They can be made fast partially because they are attached to
	individual data items but mostly because we can interact with
	them in ways that do not disturb a deep pipe.  Hence,
	- They need not be GLOBAL state bits but they are part of any
	interval register contents & hence must be swapped on context
	switch.
	- They can be constructed in such a way that the instruction
	stream interacts with them in a manner that does not generate
	needless conditional branches.
	- They ARE part of the context switch but not necessarily part
	of a GLOBAL context.
	- And the can be made to operate in parallel with all the cache
	coherency issues related to that.

	This last point is much like that in the the Intel parallel
	instructions in that, since the instruction assumes a stride
	of 1 through memory, it can be assumed that any load/store is
	independent of any other load/store within the same instruction.
	Other cache coherency issues (as between multiple independent
	processors) still exist if these processors operate out of the
	same memory space.  The Cell processor solves this problem by
	giving each processor element its own memory space.  Other
	machines still have cache coherency hardware to contend with.

	Well, that's it mostly.
	
	I think I've gone over my $0.02 worth on this issue.

	You may keep the change. :-)


				Dan