Thread Links			Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

Re: Decorated intervals ARE intervals...

To: stds-1788@xxxxxxxxxxxxxxxxx
Subject: Re: Decorated intervals ARE intervals...
From: Ian McIntosh <ianm@xxxxxxxxxx>
Date: Tue, 5 Jul 2011 20:09:29 -0400
Cc: Nate Hayes <nh@xxxxxxxxxxxxxxxxx>
Delivered-to: mhonarc@xxxxxxxxxxxxxxxx
In-reply-to: <8A2D76DF096B42F9A5435349A1975EF4@KLENDATHU>
List-help: <http://listserv.ieee.org/cgi-bin/wa?LIST=STDS-1788>, <mailto:LISTSERV@LISTSERV.IEEE.ORG?body=INFO%20STDS-1788>
List-owner: <mailto:STDS-1788-request@LISTSERV.IEEE.ORG>
List-subscribe: <mailto:STDS-1788-subscribe-request@LISTSERV.IEEE.ORG>
List-unsubscribe: <mailto:STDS-1788-unsubscribe-request@LISTSERV.IEEE.ORG>
References: <20110702000446.77E29126ABF8@xxxxxxxxx> <83658120-B989-4865-9368-74351EB8C168@xxxxxxxxxxxx> <4E10CF13.4050002@xxxxxxxxxxxx> <8A2D76DF096B42F9A5435349A1975EF4@KLENDATHU>
Sender: stds-1788@xxxxxxxx

Another issue is that on most computers the performance is also determined by the data alignment.

Assume a 16 byte wide data bus, to transfer two double precision values at once, and keep it simple by only considering loads not the more complicated stores.

The following depends on the processor, but is true on many: Generally the fastest transfer time will be when the address is an integral multiple of 16. You may usually get the same time when the address is only a multiple of 8 (eg, address 24), except when crossing cache line and page boundaries. Or you might get something twice as slow, at best. On some systems you'll also get the maximum performance when the address is only a multiple of 4, but you're more likely to get a slowdown. When it's only a multiple of 2, a slowdown is almost certain on high performance systems, and when it's an odd number it's close to guaranteed.

How much is the slowdown? In some cases it's only 2x, with circuits detecting that two adjacent 16 byte transfers are needed and more circuits to extract the appropriate parts of each and stitch them together. On some CPUs that is done by microcode instead of circuits, and you may find it more like 10x slower. On some it involves flushing the pipeline and restarting it differently, so maybe 15x slower. If you're crossing a cache line boundary you may have an extra slowdown, especially if part of the data is in cache and part not. Loading something that's seriously misaligned, or crossing a page boundary, may cause a trap to the operating system kernel and software emulation of the instruction, say 1000x or 2000x slower.

An analogy may help. Picture a 16 car long subway station platform, with a track on each side. A 16 car empty train arrives on one side, and a full 16 car train on the other. The passengers have to transfer from one to the other. In the best case, all the doors open, all the passengers walk straight across, all the doors close, and the trains leave.

Now picture the same platform and trains, except that the first 12 cars worth of passengers stay put and the last 4 cars worth come out and walk up the platform to corresponding seats in the first 4 of the destination train. Then the first source train leaves while the destination train sits. After a while, another source train arrives, and the doors of the first 12 cars open. The passengers walk down the platform into the last 12 cars of the destination train. Now the doors close and the train can leave. It takes two trains, meaning twice the time, and the extra walking up and down the platform adds more time (just as shuffling bytes would take another pipeline cycle).

Once in a while (just as once in a while 16 unaligned bytes would cross a page boundary) you cross a day boundary. The subway shuts down 4 hours, with the destination train partly full, until morning. Those passengers will wait a long long time.

That's without length mismatches. It's why compilers and programmers try to get data nicely aligned.

Next picture the destination train being 17 cars long. The source train stays 16 long. (Why? Because that's the way memory hardware is designed and addressed, and that won't change unless you want to put divide by 17 hardware in the critical path address bus and unless every access of every kind of data or instruction is for 17 not 16 bytes.) Filling the 17 cars will always require passengers from 2 trains, so you're guaranteed to be a little worse than 2x as slow as aligned would have been.

If you're on a CPU whose hardware circuits can only handle misalignment of 8 or 4, and gets really slow for 2 or 1, imagine that 1/8 of the time all the passengers have to go back up to street level, line up and buy new tickets, and you'll have an idea of the performance impact. (Obviously if 17 byte performance matters the CPU designers would pay the general performance cost to ensure 2 and 1 byte alignment was handled well.)

Note that this is just the hardware slowdown. If you start storing arrays of 17 byte data, you'll also need to start multiplying subscripts by 17 instead of 16 (ie, a fast shift left by 4 bits), and that takes at least 3x as long. Unless you have a totally hardware interval implementation, calculating decorations will also increase the time each operation takes.

You can mitigate some of the worst problems by storing each 17 byte value in 32 bytes, so everything is correctly aligned. You still need to move 32 bytes when a bare interval would only move 16, so you still have a 2x slowdown along with a 2x memory increase (increasing cache misses).

You could also build a system moving 32 bytes at a time, at higher cost. Or you could go for 8 bytes, or 4 bytes, reducing your slowdown by giving up performance in the first place. Whatever you do, you lose.

Yes, compilers can help by sometimes proving that the 17th byte is unneeded. I work on a compiler optimizer, and I bet that's harder for real programs than you would expect. Take code of your choice with if statements, loops, function calls and interval arrays, and try to figure out when a particular value's decoration can't possibly affect anything. If a value's decoration is used (eg, printed) then the input decorations are used to calculate it. If it isn't used, why would the programmer pay at least double to use a decorated interval instead of a bare one?

Decorated intervals are needed for many applications and for parts of others, but not for everything, and not every application should have to pay the price. Not everybody wants to wear seatbelts 24/7.

- Ian McIntosh IBM Canada Lab Compiler Back End Support and Development

Nate Hayes ---07/04/2011 01:07:23 AM---Baker, In response to your question, here is some of my understanding and personal

From:
Nate Hayes <nh@xxxxxxxxxxxxxxxxx>

To:
Ian McIntosh/Toronto/IBM@IBMCA

Date:
07/04/2011 01:07 AM

Subject:
Re: Decorated intervals ARE intervals...

Baker, In response to your question, here is some of my understanding and personal experience on the issue: Modern computer architectures usually have several levels of cache. This is because processing the data once it arrives at the CPU is usually much, much faster than the time it takes to read the data from memory (sometimes by as much as an order of magnitude or more, I believe). So the performance of most HPC software applications these days is determined almost entirely by how efficiently data can be moved from memory into the CPU. Memory busses (to my knowledge) usually carry a number of bits that is a power of two. Moving a non-power of two datatype (like a 17-byte decorated interval) over such a bus either requires a) changing the hardware bus so it can carry more bits or b) breaking the datatype into smaller pieces that can effectively be sent over the bus in multiple moves, which of course takes more time. Buffering a 17-byte datatype into a 32-bit window within the various levels of cache means less cache will be available to store additional data. This may cause excessive reading and writing of data to and from main memory, a situation that is often reffered to as "thrashing" of the cache. Depending on the memory-access patterns of the HPC application, this can cause the CPU to become starved for resources as the memory latencies increase and the efficiency of the cache decreases. The semiconductor industry is seeing a trend away from single-core processors that seemed every year to double in speed. Instead, the trend is towards multi-core processors and parallel computing. In our own laboratory, we have a 48-core machine, and Intel has predicted by 2015 that desktop computers may have more than a hundred processors. Both the issues of memory latency and cache "thrashing" mentioned above are the death-knell of massively parallel software. The problem is that reading data into and out of the processor (via the various cache levels and hardware busses) is not a parallel activity, i.e., it is mostly sequential. Since the maximum theoretical speedup of a parallel software program on a certain number of processors is bounded by Amdahl's Law: http://en.wikipedia.org/wiki/Amdahl's_lawand since Amdahl's Law is a function of the ratio of the parallel and sequential portions of a program, a tiny increase in the sequential portion of a program will decrease the overall performance of the HPC application in a nonlinear fashion. For example, a program that is 99% parallel and only 1% sequential has a maximum theoretical speedup potential of 100X, even if the number of available processors is unlimited. If the sequential portion of the program increases even by a very small amount to, say, 2% (so the program is now 98% parallel), the maximum theoretical speedup potential drops to 50X! In other words, it takes only a very small increase of the sequential portion of the program to quickly lose an order of magnitude (or more) in theoretical speedup potential. So if you are speaking about 12% or 34%, this could easily translate into several orders of magnitude of actual loss of performance. The good news is that interval branch-and-bound algorithms are naturally parallel. This is one of the reasons I believe so strongly that interval arithmetic is the future of the HPC industry. However, keep in mind another very important fact about branch-and-bound: decorated intervals are *useless* to these algorithms. The reasons is because as long as the computation is "safe", only the bare interval is required. On the other hand, once an "unsafe" operation occurs in the middle of a lengthy computation, only the bare decoration is required to propagate. Nowhere is both the interval and decoration portion of the decorated interval needed simultaneously. Taking all the above into consideration, these are a few reasons why I believe computing with bare intervals and bare decorations will be very important for a future IEEE 1788 standard. Sincerely, Nate BTW, attached is neat article about Amdahl's Law in the Multi-Core Era: http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf----- Original Message ----- From: "Ralph Baker Kearfott" <rbk@xxxxxxxxxxxx> To: "John Pryce" <j.d.pryce@xxxxxxxxxxxx> Cc: "stds-1788" <stds-1788@xxxxxxxxxxxxxxxxx> Sent: Sunday, July 03, 2011 3:20 PM Subject: Re: Decorated intervals ARE intervals... > John, P-1788, > > Can someone clarify what "severe storage and communication penalties" > means? For example, my naive software implementation of decorations > would be to tack on an integer (in whatever format is most efficient > for the machine) to the two doubles. If it is a 2-byte integer, the > storage penalty would be 2/(16+2) < 12 percent. Tacking on an extra > 8 bytes (wasteful from the point of view of storage, but conceivably > logical to do in some circumstances), the storage penalty would be > < 34 percent. > > In my own work, reliability and predictability of the interval > operations has usually (but not necessarily always) > trumped speed of the raw arithmetic. What is the experience of > others? > > Baker > > On 07/02/2011 08:49 AM, John Pryce wrote: >> Dan& all >> >> On 2 Jul 2011, at 01:04, Dan Zuras Intervals wrote: >>> I have stated this before& the current discussion only >>> strengthens my opinion that decorated intervals should >>> BE intervals. Full stop. Certainly at level 1 but as >>> far down as we can push it as well. >> >> It would be nice if the standard could take that view, but what about the >> "17-byte problem"? > I.e., decorating the most popular kind of interval (a pair of doubles) > looks as if it > incurs severe storage and communication penalties. >> >> That's why the decoration system includes the Hayes/Neumaier "bare object >> arithmetic" > scheme, which otherwise would be pointless. >> >> As someone who has long experience of how computer systems evolve, will >> you hazard > to predict whether the 17-byte problem will be solved by hardware advances > or software > advances (so we needn't worry about it much); or will continue to be a > problem? >> >> John >> > > > -- > > --------------------------------------------------------------- > R. Baker Kearfott, rbk@xxxxxxxxxxxxx (337) 482-5346 (fax) > (337) 482-5270 (work) (337) 993-1827 (home) > URL:http://interval.louisiana.edu/kearfott.html> Department of Mathematics, University of Louisiana at Lafayette > (Room 217 Maxim D. Doucet Hall, 1403 Johnston Street) > Box 4-1010, Lafayette, LA 70504-1010, USA > --------------------------------------------------------------- >

References:
- Decorated intervals ARE intervals...
  - From: Dan Zuras Intervals
- Re: Decorated intervals ARE intervals...
  - From: John Pryce
- Re: Decorated intervals ARE intervals...
  - From: Ralph Baker Kearfott
- Re: Decorated intervals ARE intervals...
  - From: Nate Hayes

Prev by Date: Re: KISS-decorations
Next by Date: Re: KISS-decorations
Previous by thread: Re: On caches and parallelism
Next by thread: (long) 'severe' problems, was Re: Decorated intervals ARE intervals...
Index(es):
- Date
- Thread

From:	Nate Hayes <nh@xxxxxxxxxxxxxxxxx>
To:	Ian McIntosh/Toronto/IBM@IBMCA
Date:	07/04/2011 01:07 AM
Subject:	Re: Decorated intervals ARE intervals...