Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Revised Motion 26 decoration scheme - splitting data



Nate Hayes <nh@xxxxxxxxxxxxxxxxx> wrote on 07/20/2011 05:45:06 PM:

> [image removed]

>
> Re: Revised Motion 26 decoration scheme

>
> Nate Hayes

>
> to:

>
> Ian McIntosh

>
> 07/20/2011 05:47 PM

>
> Please respond to Nate Hayes

>
> Vincent Lefevre wrote:
> > On 2011-07-20 09:42:40 -0500, Nate Hayes wrote:
> >> Vincent Lefevre wrote:
> >> >As long as you don't need a format for data interchange between two
> >> >different applications, this is not a problem. The compiler could
> >> >optimize to store data in some more optimized way than taking a
> >> >full 32-bit or 64-bit word for the decoration (e.g. by packing
> >> >decorations in a word).
> >>
> >> It doesn't matter. In fact, now the problem might be worse, since you are
> >> breaking the interval data into different pieces with different memory
> >> alignments: there is more memory locations to be accessed and some of
> >> those
> >> memory locations are now more likely not to be aligned.
> >>
> >> So memory traffic on the bus increases, the cache is more heavily taxed,
> >> and
> >> many of the moves may be unaligned (which as Ian points out are even more
> >> detrimental to performance).
> >
> > That's the goal of the compiler to ensure that data are aligned. For
> > instance, if a 64-bit alignment is OK, then 8 decorated intervals
> > (I1,D1), ..., (I8,D8) could be stored in the following way:
> >
> > I1 I2 ... I8 [D1 D2 ... D8]
> >
> > where the 8 decorations are packed in a 64-bit word (4 would be
> > sufficient for a 32-bit alignment).
>
> I understand.
>
> But think about the problem now of "putting back together", say (I1,D1) and
> (I7,D7) at the processor. You have to do the following:
>
>     -- Load I1
>     -- Load I7
>     -- Load D1
>     -- Load D7 (likely an unaligned memory access)
>
> Thats a total of 4 memory moves, and one of them likely will be unaligned.
>
>
> >
> > Of course, packing in such a way should be done with data locality
> > in mind. There may be compromises between the efficiency of packing
> > and the locality.
>
> If the data is accessed in the same sequential pattern as the packing order,
> this will be the best case. But it still requires extra memory moves for the
> decorations.
>
> If the data is accessed in any other pattern, then actual performance can be
> worse, of course.
. . .
> Nate

Alignment of floating point matters. I'm not worried about the alignment of a decoration though, since on most systems you would use single-byte loads that don't care about alignment.

Data can be stored separately and the compiler can split it apart and put it back together as needed. IBM's XL compilers split structures with hot and cold data in some cases at high optimization levels.

It's not simple, though. For something like an interval:

- To pass a split variable or array element as a value parameter or to return it as a result, you have to reassemble it.
- To pass it by reference, you have to reassemble it, store it contiguously in a temporary location, pass a reference to that, then after the function returns load the possibly modified value from the temporary, split it and store the parts in their correct locations.
- To pass an entire array, either the caller and callee have to agree on how it has been split, or you have to reassemble every element of the array into a temporary array, pass that, and after the return split every element and store each part where it belongs. The called function might also have split it apart for processing then reassembled it for the caller.
- To access array elements you have to multiply the subscript by different amounts (x16 and x1) to access the separate parts, in addition to adding different offsets. (That's an easy one.)
- If a dynamically allocated array or structure should be split, the parts are likely allocated and freed separately which costs more and is harder to keep track of.
- To create a pointer or reference to a split variable (in languages like C, C++, Java and Fortran that have pointers or references) is a problem. Which part do you point to? You could point to one part that contains data plus a compiler-invented pointer to the other part, but for intervals that's a lot of overhead. The other answer is that pointers to split parts have to be "fat" pointers, containing two addresses, and that has all kinds of implications.
- A special case of pointers or references is the "this" parameter in C++ or Java class methods, where the compiler passes the address of the object the method (function) will work on. Either the entire compiler has to know that pointers to the split class are fat or that one part points at the other, or the initial phases assume they're regular pointers and later phases try to patch up. Of course if you managed to inline all calls to all methods it would become easier but that's unlikely and the general case has to work correctly.
- If any global data will be split, all separately compiled parts of the program must agree exactly on how it is split. That may apply even if the parts are compiled by separate compilers from separate vendors (eg, you get a library from somebody compiled with the XXX compiler, and you use the YYY compiler).
- I know of no language that makes it easy for a user to declare that a variable should be stored in a split way. Unless you want to add some new kind of type attribute to the language, or add an advisory pragma, it would be entirely the compiler's responsibility to detect that this type / class is profitable to split.

All of that is doable. Two problems are that it costs compile time and it costs execution time. The bigger problem is that it costs lots of development time, and until intervals are very popular it will be hard to persuade compiler vendors that this is how they should spend their time and money when they have long lists of other needs. It's unlikely to happen soon enough for any of us to implement or use.

We should assume that decorations double the amount of memory used, and double the amount of time the load and store instructions take. The time for the actual operations may dwarf that anyway.

- Ian McIntosh IBM Canada Lab Compiler Back End Support and Development