Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Motion P1788/M0008.01 SIMD challenge



Nate, Juergen, P1788 members

I wish to restate my "SIMD challenge" (which I first posed on 20 September - I copy relevant passage below and has been discussed by Nate, Juergen and others since). Apologies for slowness.

I am concerned about several issues.
A. That whatever mechanisms P1788 chooses for exception handling, should work well in a parallel environment. B. That these mechanisms have a mathematical foundation that is specified with enough precision, and simple enough, to enable one to prove the correctness -- or incorrectness, as the case may be -- of code that uses these mechanisms. C. That, in particular, it be possible to prove Nate's assertion that a decoration model and a flagspace model are equivalent at a certain level of abstraction.

Issue A is just one of many design issues that are being discussed. I think it is a high priority but the P1788 group may put its priority lower than I do. Issue B is absolutely crucial -- what is the point of a standard, for which one can only prove correctness in the case that nothing went wrong? Issue C is not fundamental, but we shall make wrong decisions if we do not have clarity about the relation between these two models.


About A.
--------
Nate gave some C++ code that showed how what looks like "decorations on each individual interval" can be implemented to access a single flag held in a static class variable. It took me a while to get it to compile and run -- my ignorance of C++ memory allocation rules was to blame. But it works. Ned Nedialkov suggested a simpler scheme. So far so good.

But it doesn't solve my SIMD challenge. Briefly stated, this is that you have an interval extension yy=ff(xx) of function y=f(x). It could have several input and output arguments, equivalently xx and yy can be vectors. We want to evaluate at many xx's "simultaneously" using vectorised elementary functions. Possibly this is to exploit a vector architecture, but even if not, it is expected to run fast. For instance this is the central part of the vector interval add in BIAS, namely BiasAddVIVI in Bias1.c:
  for (i = 0; i < P; i++) (pR++)->inf = (pA++)->inf + (pB++)->inf;
  for (i = 0; i < P; i++) (pR++)->sup = (pA++)->sup + (pB++)->sup;
where P is the vector length. (Loop initialisations and rounding mode changes omitted.) Even on a single core, non-vector architecture I believe this could generate very fast code.

SIMD challenge:
Arrange things so that there is a _flag vector_ of length P, to record exceptions in each individual yy=ff(xx) evaluation.

My rationale for wanting this is that
- A single flag gives grossly inadequate information.
- The decoration solution is a bad design: whether "something went wrong" in the p'th "thread", p=1 to P is not held in one place but is scattered over the components of the p'th output yy(:,p).

My take on this challenge is that it cannot be solved with the mechanisms Nate has described so far, because there is a missing piece of information that one MUST give the run time system, so it knows what to do. Namely, the value of P, the vector length.

Thus -- ignoring, for now, any issues of how this is bound to a language -- at an abstract level we need a structure of the form (1) User program tells run time system "I am starting a SIMD calculation of length P". (2) User program executes SIMD code for yy=ff(xx) on vectors of length P.
(3) User program tells run time system "End of SIMD calculation".

In response to (1), run time system must do several things including:
 (1a) Allocate memory for a flag vector F of length P.
(1b) Tell the vector elementary function library (e.g. the Bias1.c functions of BIAS) that F is where exceptions are to be recorded. (1c) Check that all input vectors to (2), supplied by the user program, are indeed of length P.

In response to (3), (1b) must be negated somehow. Also, there must be some way for the user program to access the information in F after (3) was done. For instance, F may be created on the heap; and, when told (3), the run time system passes a pointer to F back to the user program. There are many other possible mechanisms.

Clearly, (1) and (3) are an abstract description of a _scope_.

- Could such a scheme be implemented by a version of decorations, in a C++ class or a Fortran module? I think I see how to do so. Nate, what think you?

- Does such a scheme fit well with the "flagspace" language extension? Flagspace (so far) is less well specified than decorations, but I believe the fit is excellent.
  * Flagspace is a scope.
  * If one allows an (optional) argument of the flagspace to
    give the value P, then the runtime system has what it
    needs to perform (1a,b,c). Possible syntax
      flagspace FS1 { ... body ... }// without argument
      flagspace FS2[1000] { ... body ... }// with argument
      // This causes a flag vector of length 1000 to be created.
  * The flagspace authors propose methods to pass the flag
    values back to the enclosing scope on exit. Just what is
    needed! However one generally expects no connection
    between flag vector lengths (if any) in one scope and in
    the surrounding scope. This needs to be considered.
  * Flagspace scopes can be nested. Yes, I think this is
    potentially an important feature to give to my SIMD
    concept. I can certainly see it being useful in code
    for PDEs, multidimensional FFTs, etc.

Various semantic issues must be decided, such as: Is the vector interval library (e.g. Bias1.c) ONLY available for use within a scope delimited by (1) and (3)? But these are secondary.


About B. (mathematical foundation, leading to provable correctness)
--------
For my favourite example, "if certain exceptions are NOT raised after evaluating yy=ff(xx) then a fixed point theorem can be applied", this doesn't seem too scary. It looks straightforward to prove using either a decoration or a flagspace model -- though the devil (or God) will be in the details.

What about other people's favourite examples?


About C. (decoration & flagspace are essentially equivalent)
--------
Juergen and Nate seem to agree this is so, but what is the level of abstraction at which this theorem -- and its proof -- must be stated?

- Implementing flagspace via decoration seems to rely _essentially_ on memory aliasing, which looks like a level 3 or 4 notion. I fear a proof at this level will be language-dependent and _very_ messy.

- Implementing decoration via flagspace? At present I do not see how this should be done.

Comments please.

Best wishes

John
-----------------------------------------------
John Pryce wrote on 20 Sep 2009
Really, I would like to handle the scenario where "global flag" and "huge SIMD calculation" are combined. Namely, I have

-  a vector function f: D \subseteq R^n -> R^n
-  a box xx in R^n
and you want to check that
-  f is defined and continuous on the whole of xx, and
-  f maps xx into itself,
which proves that a fixed-point theorem can be applied.
With a "FLAG" approach it works very simply:
- Assume f has been coded up as a function, as this is the typical case.
- Evaluate the interval version ff of f:
     yy = ff(xx)   where xx, yy are interval vectors of length n.


But I now have P = a million xx vectors at once, and I wish to evaluate the P output vectors yy in SIMD mode, assuming that the code for ff can indeed be vectorised in that way (which is language- and compiler-dependent). And I want the FLAG reported individually for each component. A picture (using indexing from 1, not 0)

        1  2     ...        P
xx(1)   x  x  x  ...  x  x  x    }
...       ...    ...    ...      }  inputs, an n by P array
xx(n)   x  x  x  ...  x  x  x    }

v1      x  x  x  ...  x  x  x    }
v2      x  x  x  ...  x  x  x    }  intermediate variables in ff,
...                              }  each is a P-vector

yy(1)   x  x  x  ...  x  x  x    }
...       ...    ...    ...      }  outputs, an n by P array
yy(n)   x  x  x  ...  x  x  x    }

FLAG    x  x  x  ...  x  x  x    }  a P-vector

Column p (for parallel) of this picture represents calculating (in Matlab notation)
   yy(:,p) = ff(xx(:,p)),    for p=1:P.

This is done a row at a time, e.g.
  v1 = xx(1,:) .* xx(3,:);
  v2 = sin(v1);
  ...
This needs a vectorised library of interval arithmetic operations (and elementary functions) like the level-1 functions in BIAS. These operations set FLAG componentwise where appropriate.

I continue to believe that for some applications, this gives a cleaner conceptual scheme than does leaving FLAG information in the decorations of individual output components yy(i,p).

In discussion at PPAM we seem to have agreed this FLAG scheme is efficiently implementable, with a "regional" FLAG vector. (Contributions from Juergen W v Gudenberg, Vladik Kreinovich).

Snag: I don't at present see how, using standard Fortran or C++ constructs, one can automatically ensure FLAG is defined, and is a vector of the same length as the numeric vectors -- if not, one risks indexing outside array bounds. However Juergen is thinking about a possible new C++ construct he calls "flagspace", syntactically similar to "namespace". Juergen, could it solve this problem?