Thread Links			Date Links
Thread Prev	Thread Next	Thread Index	Date Prev	Date Next	Date Index

Rounded operations: test

To: "Ralph Baker Kearfott" <rbk@xxxxxxxxxxxx>, "Hossam A. H. Fahmy" <hfahmy@xxxxxxxxxxxxxxxxxxxxxxx>
Subject: Rounded operations: test
From: "Nate Hayes" <nh@xxxxxxxxxxxxxxxxx>
Date: Sun, 10 Jul 2011 12:38:51 -0500
Cc: "stds-1788" <stds-1788@xxxxxxxxxxxxxxxxx>
Delivered-to: mhonarc@xxxxxxxxxxxxxxxx
In-reply-to: <4E19AB1A.1060303@xxxxxxxxxxxx>
List-help: <http://listserv.ieee.org/cgi-bin/wa?LIST=STDS-1788>, <mailto:LISTSERV@LISTSERV.IEEE.ORG?body=INFO%20STDS-1788>
List-owner: <mailto:STDS-1788-request@LISTSERV.IEEE.ORG>
List-subscribe: <mailto:STDS-1788-subscribe-request@LISTSERV.IEEE.ORG>
List-unsubscribe: <mailto:STDS-1788-unsubscribe-request@LISTSERV.IEEE.ORG>
Organization: Sunfish Studio, LLC
References: <4E16AC1B.1090006@xxxxxxxxxxxxxx> <CAA9By-3R3uep1PVFZ5sLU-V2La7KjZ5jBnr5pJzxx3=3RCc-eQ@xxxxxxxxxxxxxx> <4E19AB1A.1060303@xxxxxxxxxxxx>
Reply-to: "Nate Hayes" <nh@xxxxxxxxxxxxxxxxx>
Sender: stds-1788@xxxxxxxx

Baker, P1788,

I ran a few tests this morning.

The first test performs 1 billion additions by explicitly:
   -- saving the state of the processor rounding mode
   -- setting rounding mode to "round up"
   -- performing the floating-point addition
   -- restoring the processor rounding mode to the saved state

The second test performs 1 billion additions in "round to nearest" mode by:
   -- performing the floating-point addition
   -- adjusting the result of the addition by incrementing 1 ULP

In both cases, 10 iterations are explicitly unrolled within the loop. I also
did a dump of the compiler-generated assembler code to verify the test was
fair. In both tests, the compiler properly inlined all 10 explicit
iterations, so there is absolutely no function-call overhead within the
loops for either test.

The first test took about 15 seconds, and the second test took about 14
seconds.

A few notes: in the second test, I used the nextup() function to increment
an IEEE 754 double by 1 ULP. Admittedly, this function has lots of
branching. Indeed, the compiler faithfully unrolled all these branches 10
times within the main loop. There may be faster ways to simulate upward
rounding, but this particular nextup() method is guaranteed to at least
always be accurate within 1 ULP.

In the first test, I think a key to getting decent performance is to make
sure the addition operation is immediately preceeded and followed by the
operations to change the rounding mode, i.e.:

 fldcw  rw        ; change rounding mode to "round up"
 fadd                ; perform the addition operation
 fldcw  cw    ; restore the saved rounding mode

I *believe* I remember reading in the optimization manuals that both Intel
and AMD processors can recognize this pattern of assembly operations and
avoid flushing the floating-point pipeline... but I can't be sure (and may
be wrong).

Both tests were run on
   AMD Athlon 64 FX-60 Dual Core Processor 2.60 GHz
on 64-bit Windows 7, and compiled with Microsoft Visual Studio 2010 Ultimate
optimizing compiler.

Attached is the source code so people could try running on different
platforms, if they wish.

Nate

----- Original Message -----From: "Ralph Baker Kearfott" <rbk@xxxxxxxxxxxx>

To: "Hossam A. H. Fahmy" <hfahmy@xxxxxxxxxxxxxxxxxxxxxxx>
Cc: "stds-1788" <stds-1788@xxxxxxxxxxxxxxxxx>
Sent: Sunday, July 10, 2011 8:37 AM
Subject: Re: Motion 24.03: NO

Hossam, P-1788,

My understanding of the difference between 754 and 24.03
is the distinction between "rounding attribute" and
"operation."  The "rounding attribute" typically
(but not always) has been implemented as a processor
state, with associated efficiency issues when the
rounding mode is changed.  I believe (and correct me
if I am wrong) the proposers of Motion 24.03 feel
efficiency can be gained with an implementation
of separate opcodes (addup, adddown, addnear, etc.) rather than
forcing a processor mode change every time a different rounding
direction from the previous operation is required.

Of course, within the standard wording itself, the operations
can be implemented, in software, on top of mode changes,
or the mode changes can be implemented on top of the operations, so
there really is little difference.  However, language does affect how
people think about things, and, I think, the hope of the
proponents of 24.03 is chip designers might seriously consider
separate opcodes rather than modes.  I believe, however, that
would be a significant change for, say, the Intel line.  (People,
please correct me.)

I did some experiments in timing interval arithmetic on desktops
of about 15 years ago.  I found, in a loop, say, with 10^6 or so
additions (unrolled to limit loop overhead), changing the rounding
mode was MUCH slower than sacrificing accuracy by simulating
directed rounding.  In my simulations, I would, say, for an addition,
do the addition with rounding to nearest in effect, then
multiply the result by (1+\epsilon) for simulated upward rounding.
The difference was so large that I used the simulated rounding
in my work.
I haven't done such experiments with more recent chips and compilers.

You may know that interval arithmetic may be done on mode-based hardware
without changing the rounding mode, by storing [a,b], say, as
<-a,b>, but that has its own issues ...

Baker

On 07/10/2011 07:51 AM, Hossam A. H. Fahmy wrote:

Dear 1788 members,

The motion says:
"   Every IEEE 1788 compliant system shall provide the four basic
arithmetic
operations addition, subtraction, multiplication, and division with
rounding downwards and
upwards. Type conversions with directed roundings shall also be
provided."

If we decide to base 1788 on 754 then I do not see any need for this
specific motion. Any 754 compliant system MUST provide what this motion
requires. The relevant part from 754-2008 is on page 16:

"An implementation of this standard shall provide roundTiesToEven and
the three directed rounding
attributes. A decimal format implementation of this standard shall
provide roundTiesToAway as a user-
selectable rounding-direction attribute. The rounding attribute
roundTiesToAway is not required for a
binary format implementation."

In order to correctly keep what is 754 within 754 and what is 1788
within 1788 I vote NO on motion 24.03.

--
Hossam A. H. Fahmy
Associate Professor
Electronics and Communications Department
Cairo University
Egypt



--

---------------------------------------------------------------
R. Baker Kearfott,    rbk@xxxxxxxxxxxxx   (337) 482-5346 (fax)
(337) 482-5270 (work)                     (337) 993-1827 (home)
URL: http://interval.louisiana.edu/kearfott.html
Department of Mathematics, University of Louisiana at Lafayette
(Room 217 Maxim D. Doucet Hall, 1403 Johnston Street)
Box 4-1010, Lafayette, LA 70504-1010, USA
---------------------------------------------------------------


typedef short		Short;   // 16-bit signed integer
typedef int		Int;     // 32-bit signed integer
typedef long long	Long;    // 64-bit signed integer
typedef double		Double;  // 64-bit IEEE 754 floating-point value

Double nextup( Double x ) {
		
	Long e = *reinterpret_cast< Long* >( &x );
			
	if ( 0x7FF0000000000000 <= e ) {
		return x;
	} else if ( 0x0000000000000000 <= e ) {
		e++;
		return *reinterpret_cast< Double* >( &e );
	} else if ( 0xFFF0000000000000 < e ) {
		return x;
	} else if ( 0x8000000000000000 < e ) {
		e--;
		return *reinterpret_cast< Double* >( &e );
	} else {
		return 0;
	}
}

Double add( Double x, Double y ) {

	Short cw;
	Short rw;

	__asm {

		fld			x
		fld			y
		fnstcw		cw
		mov			ax, cw
		and			ax, 0F3FFh
		or			ax, 00800h
		mov			rw, ax
		fldcw		rw
		fadd
		fldcw		cw
	}
}

void main() {
	
	const Int N = 100000000;

	time_t t0 = time(0);
	
	Double x1 = 0.123e-4;
	Double y1 = 0.214e-4;

	for ( Int i = 0; i < N; i++ ) {
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
		x1 = add( x1, y1 );
	}

	time_t t1 = time(0);
	
	Double x2 = 0.123e-4;
	Double y2 = 0.214e-4;

	for ( Int i = 0; i < N; i++ ) {
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
		x2 = nextup( x2 + y2 );
	}

	time_t t2 = time(0);
	
	printf( "%g, %g\n", x1, y1 );
	printf( "%g, %g\n", x2, y2 );
	printf( "%d additions:\n", N*10 );
	printf( "Changing rounding mode: %d seconds\n", t1-t0 );
	printf( "Nextup method: %d seconds\n", t2-t1 );
}



Program output:

21400, 2.14e-005
21400, 2.14e-005
1000000000 additions:
Changing rounding mode: 15 seconds
Nextup method: 14 seconds
Press any key to continue . . .

Follow-Ups:
- Oops, I missed your conclusions Re: Rounded operations: test
  - From: Ralph Baker Kearfott
- Re: Rounded operations: test
  - From: Ralph Baker Kearfott

References:
- Motion 24.03: NO
  - From: Frédéric Goualard
- Re: Motion 24.03: NO
  - From: Hossam A. H. Fahmy
- Re: Motion 24.03: NO
  - From: Ralph Baker Kearfott

Prev by Date: Re: Rounded operations: test
Next by Date: Re: Motion 24.03: NO
Previous by thread: Re: Motion 24.03: NO
Next by thread: Re: Rounded operations: test
Index(es):
- Date
- Thread