Re: Motion P1788/M0008.01 SIMD challenge
Nate, Juergen, P1788 members
I wish to restate my "SIMD challenge" (which I first posed on 20
September - I copy relevant passage below and has been discussed by
Nate, Juergen and others since). Apologies for slowness.
I am concerned about several issues.
A. That whatever mechanisms P1788 chooses for exception handling,
should work well in a parallel environment.
B. That these mechanisms have a mathematical foundation that is
specified with enough precision, and simple enough, to enable one to
prove the correctness -- or incorrectness, as the case may be -- of
code that uses these mechanisms.
C. That, in particular, it be possible to prove Nate's assertion that
a decoration model and a flagspace model are equivalent at a certain
level of abstraction.
Issue A is just one of many design issues that are being discussed. I
think it is a high priority but the P1788 group may put its priority
lower than I do. Issue B is absolutely crucial -- what is the point
of a standard, for which one can only prove correctness in the case
that nothing went wrong? Issue C is not fundamental, but we shall
make wrong decisions if we do not have clarity about the relation
between these two models.
About A.
--------
Nate gave some C++ code that showed how what looks like "decorations
on each individual interval" can be implemented to access a single
flag held in a static class variable. It took me a while to get it to
compile and run -- my ignorance of C++ memory allocation rules was to
blame. But it works. Ned Nedialkov suggested a simpler scheme. So far
so good.
But it doesn't solve my SIMD challenge. Briefly stated, this is that
you have an interval extension yy=ff(xx) of function y=f(x). It could
have several input and output arguments, equivalently xx and yy can
be vectors. We want to evaluate at many xx's "simultaneously" using
vectorised elementary functions. Possibly this is to exploit a vector
architecture, but even if not, it is expected to run fast. For
instance this is the central part of the vector interval add in
BIAS, namely BiasAddVIVI in Bias1.c:
for (i = 0; i < P; i++) (pR++)->inf = (pA++)->inf + (pB++)->inf;
for (i = 0; i < P; i++) (pR++)->sup = (pA++)->sup + (pB++)->sup;
where P is the vector length. (Loop initialisations and rounding mode
changes omitted.) Even on a single core, non-vector architecture I
believe this could generate very fast code.
SIMD challenge:
Arrange things so that there is a _flag vector_ of length P, to
record exceptions in each individual yy=ff(xx) evaluation.
My rationale for wanting this is that
- A single flag gives grossly inadequate information.
- The decoration solution is a bad design: whether "something went
wrong" in the p'th "thread", p=1 to P is not held in one place but is
scattered over the components of the p'th output yy(:,p).
My take on this challenge is that it cannot be solved with the
mechanisms Nate has described so far, because there is a missing
piece of information that one MUST give the run time system, so it
knows what to do. Namely, the value of P, the vector length.
Thus -- ignoring, for now, any issues of how this is bound to a
language -- at an abstract level we need a structure of the form
(1) User program tells run time system "I am starting a SIMD
calculation of length P".
(2) User program executes SIMD code for yy=ff(xx) on vectors of
length P.
(3) User program tells run time system "End of SIMD calculation".
In response to (1), run time system must do several things including:
(1a) Allocate memory for a flag vector F of length P.
(1b) Tell the vector elementary function library (e.g. the Bias1.c
functions of BIAS) that F is where exceptions are to be recorded.
(1c) Check that all input vectors to (2), supplied by the user
program, are indeed of length P.
In response to (3), (1b) must be negated somehow. Also, there must be
some way for the user program to access the information in F after
(3) was done. For instance, F may be created on the heap; and, when
told (3), the run time system passes a pointer to F back to the user
program. There are many other possible mechanisms.
Clearly, (1) and (3) are an abstract description of a _scope_.
- Could such a scheme be implemented by a version of decorations, in
a C++ class or a Fortran module? I think I see how to do so. Nate,
what think you?
- Does such a scheme fit well with the "flagspace" language
extension? Flagspace (so far) is less well specified than
decorations, but I believe the fit is excellent.
* Flagspace is a scope.
* If one allows an (optional) argument of the flagspace to
give the value P, then the runtime system has what it
needs to perform (1a,b,c). Possible syntax
flagspace FS1 { ... body ... }// without argument
flagspace FS2[1000] { ... body ... }// with argument
// This causes a flag vector of length 1000 to be created.
* The flagspace authors propose methods to pass the flag
values back to the enclosing scope on exit. Just what is
needed! However one generally expects no connection
between flag vector lengths (if any) in one scope and in
the surrounding scope. This needs to be considered.
* Flagspace scopes can be nested. Yes, I think this is
potentially an important feature to give to my SIMD
concept. I can certainly see it being useful in code
for PDEs, multidimensional FFTs, etc.
Various semantic issues must be decided, such as: Is the vector
interval library (e.g. Bias1.c) ONLY available for use within a scope
delimited by (1) and (3)? But these are secondary.
About B. (mathematical foundation, leading to provable correctness)
--------
For my favourite example, "if certain exceptions are NOT raised after
evaluating yy=ff(xx) then a fixed point theorem can be applied", this
doesn't seem too scary. It looks straightforward to prove using
either a decoration or a flagspace model -- though the devil (or God)
will be in the details.
What about other people's favourite examples?
About C. (decoration & flagspace are essentially equivalent)
--------
Juergen and Nate seem to agree this is so, but what is the level of
abstraction at which this theorem -- and its proof -- must be stated?
- Implementing flagspace via decoration seems to rely _essentially_
on memory aliasing, which looks like a level 3 or 4 notion. I fear a
proof at this level will be language-dependent and _very_ messy.
- Implementing decoration via flagspace? At present I do not see how
this should be done.
Comments please.
Best wishes
John
-----------------------------------------------
John Pryce wrote on 20 Sep 2009
Really, I would like to handle the scenario where "global flag" and
"huge SIMD calculation" are combined. Namely, I have
- a vector function f: D \subseteq R^n -> R^n
- a box xx in R^n
and you want to check that
- f is defined and continuous on the whole of xx, and
- f maps xx into itself,
which proves that a fixed-point theorem can be applied.
With a "FLAG" approach it works very simply:
- Assume f has been coded up as a function, as this is the typical
case.
- Evaluate the interval version ff of f:
yy = ff(xx) where xx, yy are interval vectors of length n.
But I now have P = a million xx vectors at once, and I wish to
evaluate the P output vectors yy in SIMD mode, assuming that the
code for ff can indeed be vectorised in that way (which is
language- and compiler-dependent). And I want the FLAG reported
individually for each component. A picture (using indexing from 1,
not 0)
1 2 ... P
xx(1) x x x ... x x x }
... ... ... ... } inputs, an n by P array
xx(n) x x x ... x x x }
v1 x x x ... x x x }
v2 x x x ... x x x } intermediate variables in ff,
... } each is a P-vector
yy(1) x x x ... x x x }
... ... ... ... } outputs, an n by P array
yy(n) x x x ... x x x }
FLAG x x x ... x x x } a P-vector
Column p (for parallel) of this picture represents calculating (in
Matlab notation)
yy(:,p) = ff(xx(:,p)), for p=1:P.
This is done a row at a time, e.g.
v1 = xx(1,:) .* xx(3,:);
v2 = sin(v1);
...
This needs a vectorised library of interval arithmetic operations
(and elementary functions) like the level-1 functions in BIAS.
These operations set FLAG componentwise where appropriate.
I continue to believe that for some applications, this gives a
cleaner conceptual scheme than does leaving FLAG information in the
decorations of individual output components yy(i,p).
In discussion at PPAM we seem to have agreed this FLAG scheme is
efficiently implementable, with a "regional" FLAG vector.
(Contributions from Juergen W v Gudenberg, Vladik Kreinovich).
Snag: I don't at present see how, using standard Fortran or C++
constructs, one can automatically ensure FLAG is defined, and is a
vector of the same length as the numeric vectors -- if not, one
risks indexing outside array bounds. However Juergen is thinking
about a possible new C++ construct he calls "flagspace",
syntactically similar to "namespace". Juergen, could it solve this
problem?