[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Prolegomena to any future decimal discussion



The problem (well, a problem) for hardware is rounding and normalization. There is yet again another begged architectural question in much of the discussion here, namely the assumption of dynamic scheduling. While the widespread legacy desktop architectures are dynamically scheduled, high performance architectures have in general been statically scheduled and most predictions of architectural trends suggest more static scheduling in the future. In a static architecture, it is irrelevant that the long-latency path is rare, because all operations take the time of the longest possible path. The worst case rounding and normalization for non-DPD is rather nasty.

To deal with this I expect that general-case performance computation (hardware) will be BCD internally to the functional unit. So the choice for hardware boils down to the cost of BCD<->format. However, if you assume that data will be naturally aligned in and out (i.e. scaled rational rather than FP) then binary internal can work well for this special case and the conversion to internal BCD goes away. This assumption (of aligned data) is quite plausible to the naive user being bombarded with cleverly-constructed benchmarks. Personally, I suspect that there will be enough language booby-traps that we would see a cottage industry develop that provides tricks to ensure the assumptions. Of course, if that proves true then we might as well have standardized scaled rational from the beginning and saved the subterfuge and user cost.

Ivan

John R Harrison wrote:
Hi Ivan,

| This comment begs *major* architectural questions, because it assumes | the existence of dedicated non-memory-format FP registers and indeed a | of a load/store architecture in general. This assumption precludes most | of the current research architectures for high-performance computation, | including flow machines, computation-in-memory approaches, bit-serial | computation, most grid machines, and so on, and even would cause trouble | for such traditional architectures as vector and VLIW.

Discount the aside about restricting conversions to loads and stores if
you wish. The underlying point is simply that conversion between DPD
declets and binary millennial digits is not very costly with a piece of
combinational hardware. Such hardware can be replicated across the
necessary number of declets and put wherever conversions are needed.
Conversions could be wrapped around memory loads and stores as I was
suggesting, or around register accesses, or functional units, or
whatever. (OK, I admit that I have not considered the implications for
bit-serial computation.)

Given this, there is a fairly low upper bound on the likely performance
difference in hardware of DPD and ZPD. In software, on the other hand,
such conversion seems to necessitate table lookups and is hard to
parallelize, so the cost of DPD is higher.

John.


754 | revision | FAQ | references | list archive