[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Stds-754] Those horrible global flags



Posix threads (or other user threading explicit in the program) are only part of the problem. Another part is implicit, very lightweight, threads that have been generated by the compiler as an optimization. Your chip has seberal CPU cores on it, and the compiler partitions the application into threadlets to get them all working on your problem. Or not, if it cannot prove that the transformed program is semantically equivalent to the sequential original. Guess what modes and flags do to such a proof?

A still different problem not mentioned by Maclaren is speculation. Modern hardware will aggressively speculate down multiple lines of control before being able to resolve which line is actually the real one. Yet the global flags cannot reflect irreal speculative computation. The Itanium supplies multiple flag status registers for this - the compiler gives each speculative path its own flag registers, and has to include code to merge the true ones when it knows what really happened. Consequently at the resolve point there may be an arbitrary set of pending flags waiting to merge into the "real" state. This has interesting implications for alternative exception handling, especially as there is not temporal ordering information recorded by the hardware.

I agree with Maclaran that the only semantically sensible resolution to this mess is to let each computed value attach the set of flags that arose as a result of all computation that went into it, as if the flag bits were replicated in every number. This could be done easily by defining that any flag-raising operation generates a NaN, and encode the flags into the NaN payloads. To avoid needless noise, each flag (especially inexact which is the real culprit) must have its own disable mode, using scoped modes just like the rounding modes. The default should be that inexact is disabled and everything else produces a NaN.

Flag reduction (to a single "did anything happen") must be explicit in the source in such a scheme.

Ivan

Nick Maclaren wrote:
Gug.  Well, I am not the world's greatest expert but, as Michel says,
I do have quite a lot of experience in this area, and have spent as
much looking at technologies and thinking about it over the past
three decades.  Here are some notes, many of which I have not posted
in the past because they are too radical.


From:  "Craig Nelson" <scraig46@xxxxxxxxx>
|
|I'm pig ignorant about parallel processing, but followinng David Hough's
|separation of exceptions into trapped exceptions and untrapped exceptions
|which raise a 754 flag for possible later programmatic view, I wonder what
|the problem is.

Well, I can answer that one.

|In my ignorance, I would say that trapped 754 exceptions are no different
|from any other exceptions handled by parallel systems and don't require any
|"global" recognition.

Yes and no.  You may have missed the point that ALL exceptions in parallel
code are doubleplus ungood, and are generally mishandled.  I won't start
on POSIX's failings, as it would take me hours.

|The global examination of flags raised by untrapped exceptions in multiple
|subprocesses which requires a N->1 reduction is no different from any other
|synchronization since the program is only looking for an OR of all trap
|results (equivalent in complexity to an ANDing of "done" for typical
|synchronization).

I am afraid not.  That would be true in a clean model, but isn't
completely so in IEEE 754R and really, but REALLY, isn't in C99.  There
are two related killers here: values are not complete exception
indicators in themselves (see Java Hurts and a later section in this
posting) and false positives are allowed (nay, encouraged) in C99.
So are false negatives, undetectable failures and wrong answers,
but let's ignore them.

I will come back to the strict IEEE 754R issues later.

The issue is that you can OR/AND automatically ONLY if the flags are
never cleared or reset in the parallel code - but, to use C99 correctly,
the code has to be SOLID with such clearing and resetting.  Just like
errno.  For example, '1.0/abs(N)' may set divide-by-zero if N is 42.
The only correct code (for both errno and IEEE flags) is to
clear/preserve/reset the flags across EVERY boundary between semantics -
and C99 has 4 major classes of flag semantics: supported (most language
FP, math.h and fenv.h), need not set even on failure (remainder of
those), may set spuriously (rest of library) and you don't want to know
(complex.h).  I could post a document that I wrote for C9X discussions.



From:  Michel Hack (1-914-784-7648)                <hack@xxxxxxxxxxxxxx>
|
|This is the real problem, I think.  Compiler parallelisation often starts
|with a sequential program and involves loop scattering.  ANY test of a global
|flag would then imply serialisation and effectively break parallelisation.
|If the language supports parallelism directly, it might offer means to deal
|with this, but the current architectural MODEL of IEEE flags is that of one
|global flag.  Well, there is actually disagreement on whether that model is
|explicit or implicit, and there have been efforts to try to clarify this.

Yes, indeed.  Spot on.  Let's start with a digression on parallel models.

POSIX threads are NOT lightweight threads, but full-blown Unix processes
with shared authority, memory map and a few other resources (e.g. file
descriptors).  They are scheduled and used just like separate processes.
So the common assumption of independence is "good enough", though there
are major problems in the language arenas about what you do with flags
and exceptions when a thread is joined.  And, because they are separate
processes, they can be detached and 'join' later, asynchronously ....

OpenMP and similar models are hierarchical and much more close-coupled
and, as you say, automatic parallelisation is semantically just like
code reordering.  A chunk of code is divided into semi-independent
sections and they are run and merged in an unspecified fashion.  Now,
I am many others think that this is the only way that the parallelism
of the imminent hardware will be usable on the serial applications of
today and tomorrow.  Let's ignore scout threading, as irrelevant in
this context, and not scaling beyond 2 cores anyway.

It is the latter models that are the problem and I am talking about.



So what could be done?
----------------------

Notice that I say COULD because I don't believe that there is the will
to be really radical, and I don't believe that anything CAN be done
by minor tweaks.  I have written some notes on what could be done,
and will post if people are interested.

The key to solutions is to realise that the result of a computation
must be self-contained and independent of the results of independent
computations.  That is what is not true with the global flag model.
Now, effectively all existing languages and compilers use the model
that a self-contained result must be a value - exceptions are not
self-contained.

There is a slight niggle here - trap-diagnose-and-terminate is entirely
self-contained, but it treats the whole program as a unit.  I have no
idea what the objection is to making it a mandatory mode, as it is
trivial to implement on any system that can detect the exceptions and do
anything with them in the first place!  It may not be flexible, but is
safe and good for debugging.

Now, returning to our muttons, what needs to be done is to ensure the
concept of a value is complete - yes, Java got that bit right, even if
it got everything else wrong.  How?

Note that everything I am describing could be done either in IEEE 754R
or in the language layer, but I can assure you that 95% of vendors will
not do anything that isn't mandated by IEEE 754R as they will assume and
claim that it is forbidden, however much evidence is provided to the
contrary.  I have beaten my head against that one hundreds of times.


Aside:
------

Now, there is another issue, which is that modern CPU architectures
make a complete pig's ear of interrupts.  In particular, a great deal
of IEEE 754's edge cases are handled in software, by interrupt, and
there is only a single interrupt mechanism that handles machine checks,
IEEE 754 stuff and device interrupts.  The code necessary to sort out
that chaos is a nightmare, and is a fruitful source of really foul
bugs - not least because it is run in Ultimate Privilege, Every Check
Disabled mode.  On ancient systems, interrupts like IEEE 754 ones were
often handled in the thread context, just like subroutines.

I don't know on how many systems taking such an interrupt stops all
of the cores of a CPU, as it is invariably undocumented in the public
architecture.  What I do know is that the handler often stops all of
the other cores, or otherwise serialises the system, while it works
out what to do with the interrupt.  The two most extreme cases were
serialisation over all 100 CPUs of a large SMP and when I brought
another large SMP down by having almost all CPUs generate IEEE 754
exceptions and having only one CPU allowed to handle interrupts.

Personally, I think that IEEE 754R should take the attitude that the
hardware and operating system designers need to employ a few old fogies
to teach them how things were done better in the 1960s, and ignore
such egregious misdesign.



Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1@xxxxxxxxx
Tel.:  +44 1223 334761    Fax:  +44 1223 334679
_______________________________________________
Stds-754 mailing list
Stds-754@xxxxxxxxxxxx
http://mailman.oakapple.net/mailman/listinfo/stds-754


754 | revision | FAQ | references | list archive