Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: (long) 'severe' problems, was Re: Decorated intervals ARE intervals...



> Date: 04 Jul 2011 17:14:27 +0100
> From: "N.M. Maclaren" <nmm1@xxxxxxxxx>
> To: stds-1788@xxxxxxxxxxxxxxxxx
> Subject: Re: (long) 'severe' problems, was Re: Decorated intervals ARE intervals...
> 
> On Jul 4 2011, Dan Zuras Intervals wrote:
> 
> I shall merely pick up on a couple of points that I think are less
> clear than you imply.

	Its a fair cop.  I am often less clear than I imply. :-)
	Especially when I wax verbose.

> 
> >	Machines today get the performance they do by moving data
> >	around quickly.  From main memory to cache, from cache to
> >	registers, from registers to ALUs.  Each machine architecture
> >	approaches the problem slightly differently but, as my old
> >	friend Willy used to say, when you want to increase your
> >	bandwidth you have to either increase your band or increase
> >	your width.  There are no other choices & you cannot 'clever'
> >	your way around the problem.
> 
> Yes, but ....  Some applications are bandwidth-limited, but others
> are latency-limited, and that's MUCH harder.  Increasing the width
> may increase latency, and increasing the band may hamper scalability,
> or even prevent it when the speed of light starts to bite.

	I completely agree.

	We hit a frequency wall around 2001 & have actually had to
	back off since then for power reasons.

	We still have 2 or 3 orders of magnitude to go in both speed
	& size before Einstein & Bohr get in our way.  But we have at
	least 15 orders of magnitude to go in power before Boltzmann
	cuts us off.  So we will get there in all of them eventually.

	Signals move at around 1/3 the speed of light when all is
	said & done.  (Oddly enough, on the scale of nanometers &
	picoseconds, HEAT can move faster than coherent electrical
	signals.  But I've never been able to figure out how to use
	that.)  Since increases in frequency are directly related
	to decreases in size, the speed of light won't hit us any
	time soon.

	And while it is true that increasing the width might increase
	the latency, again you probably won't do it unless it comes
	with a decreased scale.

	We are not that far from the smoking hairy ball that was
	predicted in the 60s & 70s.  Smoking because its hot.  Hairy
	because of all the wires leading to it.  And a ball for speed
	of light reasons.  Smoking & hairy are already here.  But 3D
	chips have always been a problem.

> 
> >	Typical machines today are somewhere in between &, sad to
> >	say, closer to the former than the latter.  But this is
> >	mostly for reasons of cost & as the cost comes down things
> >	get more parallel.  I think we are about to enter an era
> >	of very wide & very parallel machines with truly obscene
> >	performance.
> 
> I don't, for the reasons I hint at above.  I have been proclaiming
> heresy in this area for several decades, and pointing out that the
> low-hanging fruit for reducing latency-dependency (vectorisable codes)
> was picked in the 1970s and there has been limited success since then.
> Indeed, there are some problems that are provably latency-dependent.

	This is also true.

	In the short term, until we solve the power problem, the
	path to greater performance is parallelism.  And that path
	is demonstrably less useful than raw speed.

	Still, take primality proving as an example of a largely
	latency dependent problem.  While some small parallelism is
	possible in the fundamental arithmetic, the best known methods
	for proving the primality of a number all involve a sequence
	of operations each of which is dependent on the previous
	result.  So long as that is true, answering the question
	IsPrime(n) for large n will always be latency limited.

	However, like most latency driven problems, one wants to
	solve many of them.  So handing out the task of computing
	IsPrime(n) for many n will always benefit from parallelism.

	Indeed, the benefit will even be O(number-of-processors),
	unlike those parallel problems which must communicate among
	the processors.

> 
> That is, of course, closely related to the frequent communication and
> synchronisation requirements of many parallel codes, which is the main
> reason that they don't give better performance even though they are
> highly parallel.  And reducing those requirements is HARD.

	Yes, this is the main deterrent to fine structure parallelism
	as a path to performance.

	As the RISC era gave way to deep pipe machines & even wide
	instruction word machines, it was discovered that the natural
	parallelism available in current computer languages like C &
	Fortran is limited by factors like 3 or 4 & often stuck at
	1 + episilon.  It is a corollary to Amdahl's law & directly
	related to Nick's observations.

	I just bought a 4-processor laptop & there are 6 & 8 CPU
	chips out there.  I have a friend who is working on a 10
	processor chip & I know that 90 processor chips have been
	made as experiments.

	So we are going to have to find ways of sucking up all that
	performance.  As not much can be gained by fine structure
	parallelism, something else must be tried.  New programming
	languages along the lines of the old APL might get people
	thinking about parallelism when they write their programs
	but it will not do us much good unless we also leave the
	x86 architecture behind.

	Then there is the classic debugging problem of not knowing
	enough linear algebra to know why you shot yourself in the
	foot. :-)


				Dan

> 
> Of course, increasing the transfer requirements by a factor of two
> makes a negligible difference to the latency, which is a wrinkle that
> many people miss!
> 
> >	OK, about that word 'severe'.  In any case the worst hit,
> >	either in space or time, is bounded on the upside by a
> >	factor of 2.  And our user has already agreed to a loss
> >	of at least a factor of two because she wants assured
> >	computing over speed or capacity.  Choosing from among
> >	the 3rd & 4th alternatives avoids that second hit & is
> >	about as good as it gets.
> 
> One can argue about the factor, but I agree that it's not a major
> issue.  There are many scientific codes where people already accept
> larger performance hits by using C++ instead of Fortran.  When one
> is talking about future parallelism, one is talking about 16-way
> and up - and 256-way and up in the performance context.  A competent
> interval implementation is no more a hit than a competent complex
> (versus real) one - a fixed, small factor not above 4 in the worst
> case (for CPU) and 2 (in memory).  Where latency is the limit, the
> factor may be only a little above 1.  Not a big deal.
> 
> 
> Regards,
> Nick Maclaren.