Re: (long) 'severe' problems, was Re: Decorated intervals ARE intervals...
> Date: 04 Jul 2011 17:14:27 +0100
> From: "N.M. Maclaren" <nmm1@xxxxxxxxx>
> To: stds-1788@xxxxxxxxxxxxxxxxx
> Subject: Re: (long) 'severe' problems, was Re: Decorated intervals ARE intervals...
>
> On Jul 4 2011, Dan Zuras Intervals wrote:
>
> I shall merely pick up on a couple of points that I think are less
> clear than you imply.
Its a fair cop. I am often less clear than I imply. :-)
Especially when I wax verbose.
>
> > Machines today get the performance they do by moving data
> > around quickly. From main memory to cache, from cache to
> > registers, from registers to ALUs. Each machine architecture
> > approaches the problem slightly differently but, as my old
> > friend Willy used to say, when you want to increase your
> > bandwidth you have to either increase your band or increase
> > your width. There are no other choices & you cannot 'clever'
> > your way around the problem.
>
> Yes, but .... Some applications are bandwidth-limited, but others
> are latency-limited, and that's MUCH harder. Increasing the width
> may increase latency, and increasing the band may hamper scalability,
> or even prevent it when the speed of light starts to bite.
I completely agree.
We hit a frequency wall around 2001 & have actually had to
back off since then for power reasons.
We still have 2 or 3 orders of magnitude to go in both speed
& size before Einstein & Bohr get in our way. But we have at
least 15 orders of magnitude to go in power before Boltzmann
cuts us off. So we will get there in all of them eventually.
Signals move at around 1/3 the speed of light when all is
said & done. (Oddly enough, on the scale of nanometers &
picoseconds, HEAT can move faster than coherent electrical
signals. But I've never been able to figure out how to use
that.) Since increases in frequency are directly related
to decreases in size, the speed of light won't hit us any
time soon.
And while it is true that increasing the width might increase
the latency, again you probably won't do it unless it comes
with a decreased scale.
We are not that far from the smoking hairy ball that was
predicted in the 60s & 70s. Smoking because its hot. Hairy
because of all the wires leading to it. And a ball for speed
of light reasons. Smoking & hairy are already here. But 3D
chips have always been a problem.
>
> > Typical machines today are somewhere in between &, sad to
> > say, closer to the former than the latter. But this is
> > mostly for reasons of cost & as the cost comes down things
> > get more parallel. I think we are about to enter an era
> > of very wide & very parallel machines with truly obscene
> > performance.
>
> I don't, for the reasons I hint at above. I have been proclaiming
> heresy in this area for several decades, and pointing out that the
> low-hanging fruit for reducing latency-dependency (vectorisable codes)
> was picked in the 1970s and there has been limited success since then.
> Indeed, there are some problems that are provably latency-dependent.
This is also true.
In the short term, until we solve the power problem, the
path to greater performance is parallelism. And that path
is demonstrably less useful than raw speed.
Still, take primality proving as an example of a largely
latency dependent problem. While some small parallelism is
possible in the fundamental arithmetic, the best known methods
for proving the primality of a number all involve a sequence
of operations each of which is dependent on the previous
result. So long as that is true, answering the question
IsPrime(n) for large n will always be latency limited.
However, like most latency driven problems, one wants to
solve many of them. So handing out the task of computing
IsPrime(n) for many n will always benefit from parallelism.
Indeed, the benefit will even be O(number-of-processors),
unlike those parallel problems which must communicate among
the processors.
>
> That is, of course, closely related to the frequent communication and
> synchronisation requirements of many parallel codes, which is the main
> reason that they don't give better performance even though they are
> highly parallel. And reducing those requirements is HARD.
Yes, this is the main deterrent to fine structure parallelism
as a path to performance.
As the RISC era gave way to deep pipe machines & even wide
instruction word machines, it was discovered that the natural
parallelism available in current computer languages like C &
Fortran is limited by factors like 3 or 4 & often stuck at
1 + episilon. It is a corollary to Amdahl's law & directly
related to Nick's observations.
I just bought a 4-processor laptop & there are 6 & 8 CPU
chips out there. I have a friend who is working on a 10
processor chip & I know that 90 processor chips have been
made as experiments.
So we are going to have to find ways of sucking up all that
performance. As not much can be gained by fine structure
parallelism, something else must be tried. New programming
languages along the lines of the old APL might get people
thinking about parallelism when they write their programs
but it will not do us much good unless we also leave the
x86 architecture behind.
Then there is the classic debugging problem of not knowing
enough linear algebra to know why you shot yourself in the
foot. :-)
Dan
>
> Of course, increasing the transfer requirements by a factor of two
> makes a negligible difference to the latency, which is a wrinkle that
> many people miss!
>
> > OK, about that word 'severe'. In any case the worst hit,
> > either in space or time, is bounded on the upside by a
> > factor of 2. And our user has already agreed to a loss
> > of at least a factor of two because she wants assured
> > computing over speed or capacity. Choosing from among
> > the 3rd & 4th alternatives avoids that second hit & is
> > about as good as it gets.
>
> One can argue about the factor, but I agree that it's not a major
> issue. There are many scientific codes where people already accept
> larger performance hits by using C++ instead of Fortran. When one
> is talking about future parallelism, one is talking about 16-way
> and up - and 256-way and up in the performance context. A competent
> interval implementation is no more a hit than a competent complex
> (versus real) one - a fixed, small factor not above 4 in the worst
> case (for CPU) and 2 (in memory). Where latency is the limit, the
> factor may be only a little above 1. Not a big deal.
>
>
> Regards,
> Nick Maclaren.