Thread Links			Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

fwd from Jim Demmel: More on repeatability

To: "N.M. Maclaren" <nmm1@xxxxxxxxx>, Hong Diep Nguyen <hdnguyen@xxxxxxxxxxxxxxxxx>, James Demmel <demmel@xxxxxxxxxxxxxxx>
Subject: fwd from Jim Demmel: More on repeatability
From: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
Date: Thu, 04 Aug 2011 14:51:31 -0700
Cc: stds-1788@xxxxxxxxxxxxxxxxx, Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
Delivered-to: mhonarc@xxxxxxxxxxxxxxxx
List-help: <http://listserv.ieee.org/cgi-bin/wa?LIST=STDS-1788>, <mailto:LISTSERV@LISTSERV.IEEE.ORG?body=INFO%20STDS-1788>
List-owner: <mailto:STDS-1788-request@LISTSERV.IEEE.ORG>
List-subscribe: <mailto:STDS-1788-subscribe-request@LISTSERV.IEEE.ORG>
List-unsubscribe: <mailto:STDS-1788-unsubscribe-request@LISTSERV.IEEE.ORG>
Reply-to: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
Sender: stds-1788@xxxxxxxx

	Jim asked me to forward this to you all.  - Dan

> Date: Thu, 04 Aug 2011 12:57:07 -0700
> From: James Demmel <demmel@xxxxxxxxxxxxxxxxx>
> To: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
> CC: "N.M. Maclaren" <nmm1@xxxxxxxxx>, stds-1788@xxxxxxxxxxxxxxxxx, 
>  Hong Diep Nguyen <hdnguyen@xxxxxxxxxxxxxxxxx>,
>  James Demmel <demmel@xxxxxxxxxxxxxxx>
> Subject: More on repeatability
> 
> Just to supply a little more background on the need for being able
> to get the same answer when you run your program more than once:
> 
> Many of you may recall a post to NA-Digest a couple of years ago,
> in which a commercial FEM software developer was asking whether
> anyone knew of a "repeatable" parallel sparse linear system solver;
> here "repeatable" means getting the same answer when you type
> a.out twice on the same machine, not the harder problem of the
> same answer on different machines. His motivation was a number
> of his customers (civil engineers) who had contractual obligations to
> their customers to get repeatable answers, i.e. "the bridge is safe"
> will not change to "the bridge is not safe" if you run the code again.
> 
> This motivated me to send email to the ~110 faculty in our
> graduate program in Computational Science and Engineering
> here at Berkeley, to ask how important repeatability was to them,
> given that nondeterministic scheduling and nonassociative floating
> point etc made it likely not to hold. The most common response was:
> (1)   "What, repeatability is going away? How will I debug?"
> followed by
> (2)   "I know better than to expect repeatability; I do error analysis."
> The two most interesting responses were the following:
> One colleague, a civil engineer who studies crack propagation,
> initially responded (2), and then said "Wait! In my simulations,
> when a certain event occurs, I go back to the initial conditions
> and resimulate the crack and collect extra information. That
> won't work anymore, especially since crack propagation is so
> forward unstable!"
> Another colleague, who has United Nations funding to analyze
> data to detect secret underground nuclear testing, said it would be
> impolitic to have his code change its mind about whether a blast
> occurred or not.
> 
> This led to further conversations with our industrial collaborators
> at Intel, on the MKL library team, and at MathWorks, about repeatability.
> The MathWorks folks said their customers certainly expected
> repeatability. The MKL team said that a future release would only
> guarantee repeatability under certain conditions: the user guarantees
> that the same number of threads are used, and that data is aligned
> identically, from call to call.
> 
> In the meantime we have hired a postdoc, Diep Nguyen (cc-ed), to work
> on this problem, basically asking how much performance you have to
> sacrifice to guarantee a repeatable answer, initially by guaranteeing
> that the same reduction trees are always used, independent of
> the number of threads and layout. Initial experiments with long
> dot products show it costs about 20% more than MKL's parallel ddot
> to guarantee reproducibility (more details available on request).
> 
> The point of all this is to say that repeatability on the same machine
> (let alone on different machines) is both widely expected and desired,
> hard to attain, and likely to be an unpleasant surprise to many users
> if and when they realize this. This is true with or without intervals.
> Of course interval bounds that are reliably narrow, if not repeatable,
> will mitigate the problem.
> 
> Jim
>

Follow-Ups:
- Re: fwd from Jim Demmel: More on repeatability
  - From: Arnold Neumaier

Prev by Date: fwd from Jim Demmel: More on repeatability
Next by Date: Re: Let's not BE NP-hard, shall we...?
Previous by thread: Re: Do I have a second? Re: Position: That the Standard for Computing with Intervals Have Only One Level 1 Requirement: Containment
Next by thread: Re: fwd from Jim Demmel: More on repeatability
Index(es):
- Date
- Thread