Thread Links			Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02

To: Van Snyder <van.snyder@xxxxxxxxxxxx>
Subject: Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
From: James Demmel <demmel@xxxxxxxxxxxxxxxxx>
Date: Sun, 24 Jan 2016 20:39:25 -0800
Cc: Ralph Baker Kearfott <rbk5287@xxxxxxxxxxxxx>, Richard Fateman <fateman@xxxxxxxxxxxx>, Ulrich Kulisch <ulrich.kulisch@xxxxxxx>, Oliver Heimlich <oliver@xxxxxxxxxxxxxxxxxxx>, Nathalie Revol <Nathalie.Revol@xxxxxxxxxxx>, "Nedialkov, Ned" <nedialk@xxxxxxxxxxx>, STDS-1788@xxxxxxxxxxxxxxxxx, Götz Alefeld <Goetz.Alefeld@xxxxxxx>, Gerd Bohlender <bohlender@xxxxxxx>, "beboppers@xxxxxxxxxxxxxxxxxxxxxxx" <beboppers@xxxxxxxxxxxxxxxxxxxxxxx>, David Biancolin <david.biancolin@xxxxxxxxx>, Jack Koenig <jack.koenig3@xxxxxxxxx>, Peter Ahrens <peter.ahrens@xxxxxxxxxxxx>, Sivan Toledo <stoledo@xxxxxxxxx>
Delivered-to: mhonarc@xxxxxxxxxxxxxxxx
In-reply-to: <1453686591.7049.294.camel@vanlap.vsnyder>
List-help: <https://listserv.ieee.org/cgi-bin/wa?LIST=STDS-1788>, <mailto:LISTSERV@LISTSERV.IEEE.ORG?body=INFO%20STDS-1788>
List-owner: <mailto:STDS-1788-request@LISTSERV.IEEE.ORG>
List-subscribe: <mailto:STDS-1788-subscribe-request@LISTSERV.IEEE.ORG>
List-unsubscribe: <mailto:STDS-1788-unsubscribe-request@LISTSERV.IEEE.ORG>
References: <201512072133.tB7LXrjP011061@d03av01.boulder.ibm.com> <566607C8.1020407@univ-nantes.fr> <56668506.32241.10C28942@smarkov.bio.bas.bg> <4d2ae23cbed566d42cf8e44e71269281@imap.univ-nantes.prive> <4AC1D68D-9FC5-4C33-ADA2-0A6C5A8E2947@cardiff.ac.uk> <56671509.3030401@kit.edu> <1449601371409.50567@utep.edu> <56673681.3070309@louisiana.edu> <D874978F-A3AB-4199-806D-219C44D9553F@mcmaster.ca> <BAA7EE2F-49B4-4783-BBA7-1FB0D241AD4F@ens-lyon.fr> <5697C339.6060308@kit.edu> <5697CAC8.3070103@berkeley.edu> <569A2573.1060506@kit.edu> <569A6F2A.3030707@berkeley.edu> <569D25D7.1090108@kit.edu> <569D713B.5020102@berkeley.edu> <56A140D7.7070505@kit.edu> <56A14C4F.8080908@berkeley.edu> <56A24ED6.1080704@kit.edu> <56A42D06.1040605@heimlich-online.net> <56A4BAB1.5050906@kit.edu> <56A5127C.3070702@berkeley.edu> <56A54168.6000300@louisiana.edu> <56A5713E.8090801@eecs.berkeley.edu> <1453686591.7049.294.camel@vanlap.vsnyder>
Sender: stds-1788@xxxxxxxx
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Thunderbird/38.5.1

The memory model here (which is in the process of being written up
in more detail, and is also described in the references below) deals
with general memory hierarchies as follows: Suppose there
are multiple layers, such as L1 cache, L2, L3, DRAM and disk.
Pick any number of top layers, say L1+L2+L3, and call that
"fast memory". The remaining layers, DRAM+disk in this
case, is "slow memory". Let M be the total size of the fast memory.
The the communication lower bounds are for the number of
words moved between fast and slow memories, L3 and DRAM
in this case, if data only moves between adjacent levels.
The same theory can describe how much data has to move
from one processor ("fast memory") to another processor
("slow memory") over a network, in a parallel implementation.

Now suppose we are doing matmul: C=A*B, and want to use
a long accumulator (of length R words) to compute each
entry of C to high accuracy. Then each entry of C must be
represented by a long accumulator, until it is completely
computed and rounded back to 1 word. So except for the
final n^2 writes of C back to slow memory (which is a lower
order term compared to the lower bound), each entry of
C is represented by an R word quantity, whether it happens
to live in a long register, L1 cache, L2 cache, etc. None of
the information in these R words can be thrown away
until the final rounding. Thus we are also assuming that
these R words are not somehow losslessly compressed
before they are moved to slow memory.

Think of the ubiquitous "blocking" optimization of matmul:
each matrix is broken into square b-by-b subblocks,
where 3 such blocks fit in fast memory. Once they are
read into fast memory (#words moved = 3*b^2),
the blocks of A and B are multiplied and added to the
block of C, so b^3 multiply-adds can be performed.
If we wanted to compute C to high accuracy,
all b^2 entries of this block of C would have to be
represented by R words. It turns out that to attain the
lower bound below, the blocks can no longer be square,
but rectangular, with shapes that depend on R. Once
my colleagues and I go over the details tomorrow,
I'll distribute them.

Jim


On 1/24/16 5:49 PM, Van Snyder wrote:

Does this result assume only one long accumulator?

What are the results for more than one long accumulator?

With hundreds of millions of transistors on one chip, having more than
one long accumulator doesn't seem like much of a difficulty.

On Sun, 2016-01-24 at 16:50 -0800, James Demmel wrote:

Regarding the memory cost of using a long accumulator,
there is a simple extension of previous communication lower
bounds for linear algebra, and other algorithms that
"resemble 3-nested loops" (in a mathematically well-defined way),
with which one can prove the following:

Recall that in the conventional case, where all matrix entries
and "temporary accumulators" occupy 1 word of memory,
then a lower bound on the amount of memory traffic, between
"fast memory" (say cache) and "slow memory" (say DRAM) is
     Omega(#flops / sqrt(M) )  words
where M is the fast memory size. And this lower bound can
be attained for many algorithms by well-known tiling or
divide-and-conquer techniques (for a survey see our
2014 Acta Numerica paper by Ballard/Carson/Demmel/Hoemmen/
Knight/Schwartz).

Now suppose that we want all the dot products accumulated
by these algorithms to be accumulated in a long accumulator
of length R words. For simplicity consider the case of matrix
multiply C=A*B:  we need a different accumulator for each
entry of a tile of C, in order to reuse the entries of the tiles
of A and B in fast memory. So the lower bound analysis may be
modified to prove that the lower bound increases to
    Omega(sqrt(R)*#flops/sqrt(M))
i.e. the memory traffic increases by a factor of sqrt(R). And
this may be attained by a different choice of tiles sizes.

So for an Exact Dot Product, where R >= 64, that means
at least a 8>=sqrt(R) fold increase in memory traffic.

For other "nested loop" computations, there is a similar
analysis, with the memory traffic increasing by R to some
power, where the exponent depends on the what the
nested loops look like. For example, for the direct N-body
problem, the memory traffic would increase by a factor of R,
so even worse than linear algebra.

I wrote this up informally recently, with a related but different
motivation, and am meeting with some collaborators to go over
the proof tomorrow morning. I had hoped to have this meeting
before sharing with the 1788 committee, but since Baker posed
the question, I thought I'd respond. I will let you know whether
or not the proof survives tomorrows conversation :) .

Jim

On 1/24/16 1:26 PM, Ralph Baker Kearfott wrote:

HMMM. What about memory access bandwidth and speed, and
its impact on the overall speeds of
single versus double precision?

Baker

On 01/24/2016 12:05 PM, Richard Fateman wrote:

The (re) submission of these letters to the discussion
raised a question in my mind.

    I am puzzled by the WG 2.5 vote urging the IEEE 754 committee to
incorporate two facilities.

*  [double precision interval arithmetic] *_at the speed of _*simple
floating-point arithmetic.
* provide___*high-speed*_ arithmetic [for dynamic precision intervals]

What puzzles me is the statement of speed requirements in a standard of
this nature.
Was WG 2.5 thinking that the IEEE committee was derelict in its duties
in allowing
some or all of the 754 arithmetic operations to be provided in slow
speed?

Follow-Ups:
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ulrich Kulisch

References:
- Motion P1788.1/M001.02 -- YES
  - From: Michel Hack
- Re: Motion P1788.1/M001.02 -- YES
  - From: Frédéric Goualard
- Re: Motion P1788.1/M001.02 -- YES
  - From: Svetoslav Markov
- Re: Motion P1788.1/M001.02 -- YES
  - From: Frederic GOUALARD
- Re: "natural interval extension"
  - From: John Pryce
- Re: "natural interval extension"
  - From: Neher, Markus (IANM) [IANM bezeichnet die Organisationseinheit Institut für Angewandte und Numerische Mathematik]
- Re: "natural interval extension"
  - From: Kreinovich, Vladik
- Re: "natural interval extension"
  - From: Ralph Baker Kearfott
- "natural interval extension": friendly amendment to M001.02
  - From: Nedialkov, Ned
- discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Nathalie Revol
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ulrich Kulisch
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Richard Fateman
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ulrich Kulisch
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Richard Fateman
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ulrich Kulisch
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Richard Fateman
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ulrich Kulisch
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Richard Fateman
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ulrich Kulisch
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Oliver Heimlich
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ulrich Kulisch
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Richard Fateman
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: Ralph Baker Kearfott
- Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
  - From: James Demmel

Prev by Date: How to merge required operations under a common name
Next by Date: Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
Previous by thread: Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
Next by thread: Re: discussion period begins, until Jan. 26: "natural interval extension": friendly amendment to M001.02
Index(es):
- Date
- Thread