Re: Constructors & decorations
> Subject: Re: Constructors & decorations
> From: John Pryce <j.d.pryce@xxxxxxxxxxxx>
> Date: Tue, 2 Mar 2010 12:24:04 +0000
> To: stds-1788 <stds-1788@xxxxxxxxxxxxxxxxx>
John, Vincent, Nate,
In deference to the others & the consideration of
Motion 11 at this time, I think we should take this
conversation off line until that is done.
Especially the issues of the meaning of the various
decoration states & the rules for their propagation.
But I will try to answer the hardware question below.
Alas, even this single answer is a long one as I must
go into the details of how modern deep pipe machines
handle both state & exceptions.
Those uninterested may safely delete this note.
>
> Dan & P1788
>
> On 23 Feb 2010, at 12:17, Dan Zuras Intervals wrote:
> > John Pryce had written:
> >> I believe not. I think the intention of Motion 8 is that once
> >> an interval is set notValid, _all_ its other attributes become
> >> meaningless. It has fallen into a black hole. So a V- interval
> >> is essentially the same as NaI. It follows that all query-functions
> >> on it should give results of maximum meaninglessness:
> >> - Its other decorations should be D0 C0 B0.
> >> - Its lower & upper bounds, midpoint, radius etc., should all return NaN.
> >
> > So it seems that you are reserving the interpretation
> > of 'Valid' things to those things which are imported
> > into intervals from other arithmetics. Constructors,
> > converters, & the like.
>
> Yes.
>
> > Does that not seem like a particularly narrow use for
> > a decoration? Would not 'Defined' do as well?
>
> The Motion 8.02 document says on p3:
> > -- inv is the decoration of an interval created by an invalid
> > conversion
> > ...
> > we may have ...
> > (X,ok) \union (Empty,inv) = (X,inv)
> > as an unambiguous version of X \union NaI = NaI.
>
> This seems to me to agree with my interpretation.
>
> 'Defined' has a quite different and specific purpose,
> to do with the applicability of fixed-point theorems.
>
> > ... I say this because decoration bits might end up being
> > expensive things some day. If we are successful &
> > intervals find their way into hardware, decoration bits
> > will become state bits. They will have interlocks &
> > busses devoted to them. They will be saved on context
> > switch. They will become involved in issues of
> > parallelism & cache coherency. We should make sure
> > they are useful.
>
> I am not a hardware guy so maybe I have the wrong end of the stick.
> But didn't Nate argue for decorations, in contrast to global flags,
> precisely on the grounds that
> - they are fast because they are attached to the individual data item,
> hence
> - they don't need to be in state bits
> - they don't need interlocks or busses
> - they are NOT part of a context switch
> - they don't affect parallelism & cache coherency more than any other
> kind of data
>
> Regards
>
> John
You are largely correct in so far as GLOBAL state bits are
concerned.
But let's look at how one might implement hardware involving
decorations attached to the intervals themselves.
First, let me explain how modern machines handle exceptions
in the first place.
By analogy, here is how one might add two integers in hardware
typical of today (& tomorrow):
LDZ addr1,r5
LDZ addr2,r6
ZADD r5,r6,r7
STZ r7,addr3
The first two instructions resolve address offsets to memory
locations, read (say) 32 bits from memory at that location,
& move that information into integer registers designed to
accept integers of that size. The next instruction reads
the previously stored information in those registers, passes
them along busses built into the very heart of the CPU to an
integer adder circuit which has an output buss leading back
to the integer register file. The store puts the result back
in memory.
Register bits are expensive. They are at the very heart of
what a computer does. They are not like memory bits. In
particular, modern registers have multiple dedicated read &
write ports to support parallel accesses simultaneously.
For each port there will be a buss dedicated to routing the
information on that port to/from the computation circuit it
needs to go. At the very minimum, 2 read ports & one write
port to support the above code. More often than not, 3 reads
& 3 writes to support all those operations in parallel. The
last machine I worked on had 11 read & write ports. And I
have been retired for a decade now.
So, register bits are expensive.
They are also part of the process state. If (when) the
currently running process must be interrupted for some other
process, the running of this process must be brought to some
coherent halt, all the state associated with that process is
stored into memory (including all the registers), & a new
process is loaded into them to start another thread of
execution.
How much of this state must be swapped & the available bandwidth
to memory is the thing that determines the cost of a context
switch.
This is the global state.
But even with so-called local state, something similar happens.
The Intel style parallel operations are designed to do short
vector operations much like (but simpler than) Seymour Cray did
years ago on the Cray machines. The code still looks like:
LDpZ addr1,pr5
LDpZ addr2,pr6
pZADD pr5,pr6,pr7
STpZ pr7,addr3
Only now it is carried out 4 at a time in dedicated vector
registers.
More state. More expense. More stuff to swap on context
switch. But it is considered worth it if the busses to & from
memory are wide enough to support such operations in parallel.
And, because such things are considerred common enough (in
graphics code & the like) to justify the extra expense for the
added performance gained.
The heart of our issue here is how modern machines handle things
like integer errors. Each machine does it differently & advances
continue to be made in this. Let me just say that those advances
are precisely those that attempt to eliminate integer errors from
global state. So, for example, overflows used to be stored in a
dedicated overflow register set as a side effect of the add &
tested after the add. But in a modern machine this becomes an
interlock issue.
One of the things I have not yet mentioned is the large disparity
between the CPU cycle time & the memory read time. This ratio
(mem-time/cycle-time) is large these days. This has been dealt
with by making the execution pipes correspondingly deep. So, if
I do load, load, add, store, the pipe is deep enough that the
time taken to load the operands from memory is still less than
the time the add spends in the pipe & the store can also be
accomodated (possibly) in some other pipe. Should the result
of that add be used in some other register expression, there are
dedicated bypass busses that take the output of the add (or the
output of the store) & reroute it through the pipe to the
instruction that needs it rather than wait for yet another
cycle from memory.
More dedicated hardware. More expense. But justified by the
performance gained.
The depth of this pipe can be 15 instructions or more. The last
machine I worked on was of length 25. And I have been retired
for a decade. My friends at Intel were working on 35 instruction
deep pipes at the time. And the ratio of mem-time/cycle-time has
only gotten larger.
So, now you can see why storing the result of an integer overflow
in a global overflow bit would be bad for a deep pipe machine.
Any test on such a bit would amount to a conditional branch in
the middle of this very deep pipe. This means that the thread
of execution must be brought to a halt & a new thread of
instructions loaded & executed. Given that many instructions
are in a state of 'partial' or 'speculative' execution, the
partial results of those instructions must be backed out &
flushed to make way for the new list of instructions. If (as
is likely) there are multiple branches in the pipe, there are
multiple possible threads that have to be managed.
More complexity. More expense. More design time. Less
performance.
In the days of global state bits & test & branch style code,
the typical number of instructions executed before a conditional
branch was encountered was 4. All the expense & bother of very
deep pipes are mostly wasted in this sort of test & branch
architecture.
And this cost is paid even if there are no tests or branches.
You see, at some point in the execution of even branchless code,
the pipe might have to be stopped IN ANTICIPATION of a branch in
the next instruction. Performance suffers & the effects of a
deep pipe are lost long before the end of the pipe.
So modern machines have found innovative ways of attaching the
effects of a possible exception to the instruction itself. So,
for example, one might do
ADD r5,r6,r7
when one is unconcerned with the results of an overflow and
ADDOVF r5,r6,r7,c3
where the result of the overflow is stored in a condition bit
(c3) and subsequent dependent operations are tagged with the
moral equivalent of (do this instruction if c3 is (set/clear)).
No branches. No interlocks. Some expense for the dedicated
condition registers but much less cost in the complexity of the
pipe. We now need smart compilers to mitigate the effects of
executing too many instructions that produce unneeded results.
But moving this issue to the compiler is a big win. Less global
state. Faster & less complicated hardware.
OK, that's the preamble.
But having laid that ground, the issue for dedicated interval
hardware is no different. We will likely have:
LDII addr1,iir5
LDII addr2,iir6
IIADD iir5,iir6,iir7
STII iir7,addr3
in the typical case. With decorations this implies a load
which loads both the upper & lower bounds into dedicated
registers as well as the associated decorations in THEIR OWN
dedicated registers. It implies a multi-port register file
with dedicated busses to dedicated interval CPU hardware.
As well as dedicated busses & decoration propagation circuits
to handle the result decorations. And, of course, a store
that returns the decoration along with the interval result.
But how would we handle a test on a decoration? Say we are
concerned that the result interval might be undefined (D-).
If we were to test & branch on the decoration part of result
register iir7, we would be in the same boat as the folks who
used to test for integer overflow. And the same sins would
befall us.
(This is the REAL global state issue that we need to avoid.)
But, like the conditional execution model, we could replace
IIADD iir5,iir6,iir7
with
IIADDD- iir5,iir6,iir7,iic3
storing the results of an undefined condition in a condition
bit to be used by subsequent instructions to predicate the
results of their execution without the bother of a test &
branch. Or, if we consider the decoration register as if it
were already a condition register, one might predicate
subsequent operations on the state of the decoration of iir7.
So
IISUBD- (iir7),iir8,iir9,iir10
That is, subtract 8 from 9 resulting in 10 IF the decoration
state of 7 is D-.
So, even though the state is not global, there is a cost to
it. A large non-zero cost. Interval registers will be
expensive. More expensive than integers registers but
perhaps less expensive than dedicated parallel floating-point
registers. Interval hardware will also be more expensive
than integer hardware but less expensive than 4x parallel
floating-point hardware.
And, from a standards perspective, we must design our decoration
state in such a way that dedicated decoration propagation
hardware is easy to design & implement.
(This last is the issue I have been thinking about but avoiding
talking about for so long. But I promise to talk to you guys
offline in the near future & to the group as a whole after the
idea has been thoroughly shat upon. :-)
Now, let me repeat your points:
>
> I am not a hardware guy so maybe I have the wrong end of the stick.
> But didn't Nate argue for decorations, in contrast to global flags,
> precisely on the grounds that
> - they are fast because they are attached to the individual data item,
> hence
> - they don't need to be in state bits
> - they don't need interlocks or busses
> - they are NOT part of a context switch
> - they don't affect parallelism & cache coherency more than any other
> kind of data
>
I hope you can see now that attaching decorations to the interval
itself is only the first part (necessary but not sufficient) in
making them 'local' or at least not global. How the hardware
interacts with them is also of critical importance.
And, point by point,
- They can be made fast partially because they are attached to
individual data items but mostly because we can interact with
them in ways that do not disturb a deep pipe. Hence,
- They need not be GLOBAL state bits but they are part of any
interval register contents & hence must be swapped on context
switch.
- They can be constructed in such a way that the instruction
stream interacts with them in a manner that does not generate
needless conditional branches.
- They ARE part of the context switch but not necessarily part
of a GLOBAL context.
- And the can be made to operate in parallel with all the cache
coherency issues related to that.
This last point is much like that in the the Intel parallel
instructions in that, since the instruction assumes a stride
of 1 through memory, it can be assumed that any load/store is
independent of any other load/store within the same instruction.
Other cache coherency issues (as between multiple independent
processors) still exist if these processors operate out of the
same memory space. The Cell processor solves this problem by
giving each processor element its own memory space. Other
machines still have cache coherency hardware to contend with.
Well, that's it mostly.
I think I've gone over my $0.02 worth on this issue.
You may keep the change. :-)
Dan