fwd from Jim Demmel: More on repeatability
Jim asked me to forward this to you all. - Dan
> Date: Thu, 04 Aug 2011 12:57:07 -0700
> From: James Demmel <demmel@xxxxxxxxxxxxxxxxx>
> To: Dan Zuras Intervals <intervals08@xxxxxxxxxxxxxx>
> CC: "N.M. Maclaren" <nmm1@xxxxxxxxx>, stds-1788@xxxxxxxxxxxxxxxxx,
> Hong Diep Nguyen <hdnguyen@xxxxxxxxxxxxxxxxx>,
> James Demmel <demmel@xxxxxxxxxxxxxxx>
> Subject: More on repeatability
>
> Just to supply a little more background on the need for being able
> to get the same answer when you run your program more than once:
>
> Many of you may recall a post to NA-Digest a couple of years ago,
> in which a commercial FEM software developer was asking whether
> anyone knew of a "repeatable" parallel sparse linear system solver;
> here "repeatable" means getting the same answer when you type
> a.out twice on the same machine, not the harder problem of the
> same answer on different machines. His motivation was a number
> of his customers (civil engineers) who had contractual obligations to
> their customers to get repeatable answers, i.e. "the bridge is safe"
> will not change to "the bridge is not safe" if you run the code again.
>
> This motivated me to send email to the ~110 faculty in our
> graduate program in Computational Science and Engineering
> here at Berkeley, to ask how important repeatability was to them,
> given that nondeterministic scheduling and nonassociative floating
> point etc made it likely not to hold. The most common response was:
> (1) "What, repeatability is going away? How will I debug?"
> followed by
> (2) "I know better than to expect repeatability; I do error analysis."
> The two most interesting responses were the following:
> One colleague, a civil engineer who studies crack propagation,
> initially responded (2), and then said "Wait! In my simulations,
> when a certain event occurs, I go back to the initial conditions
> and resimulate the crack and collect extra information. That
> won't work anymore, especially since crack propagation is so
> forward unstable!"
> Another colleague, who has United Nations funding to analyze
> data to detect secret underground nuclear testing, said it would be
> impolitic to have his code change its mind about whether a blast
> occurred or not.
>
> This led to further conversations with our industrial collaborators
> at Intel, on the MKL library team, and at MathWorks, about repeatability.
> The MathWorks folks said their customers certainly expected
> repeatability. The MKL team said that a future release would only
> guarantee repeatability under certain conditions: the user guarantees
> that the same number of threads are used, and that data is aligned
> identically, from call to call.
>
> In the meantime we have hired a postdoc, Diep Nguyen (cc-ed), to work
> on this problem, basically asking how much performance you have to
> sacrifice to guarantee a repeatable answer, initially by guaranteeing
> that the same reduction trees are always used, independent of
> the number of threads and layout. Initial experiments with long
> dot products show it costs about 20% more than MKL's parallel ddot
> to guarantee reproducibility (more details available on request).
>
> The point of all this is to say that repeatability on the same machine
> (let alone on different machines) is both widely expected and desired,
> hard to attain, and likely to be an unpleasant surprise to many users
> if and when they realize this. This is true with or without intervals.
> Of course interval bounds that are reliably narrow, if not repeatable,
> will mitigate the problem.
>
> Jim
>