Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Motion 46 -- Vincent's objections



On 2013-07-19 15:04:07 -0400, Michel Hack wrote:
> In a world with widely different conventions it is useful to have the
> concept of locale-dependency, but one must also be able to be universally
> precise.  I don't have much experience with the locale issues of atof(),
> say, but I did trip over such issues in the AIX libc implementation, and
> I find it troublesome that there is no guaranteed way to recognize a
> floating-point literal.  I would have preferred if the so-called C-locale
> notation was ALWAYS accepted, and others might be accepted too, especially

This is what we do in MPFR:

     Parsing follows the standard C `strtod' function with some
     extensions.  After optional leading whitespace, one has a subject
     sequence consisting of an optional sign (`+' or `-'), and either
     numeric data or special data. The subject sequence is defined as
     the longest initial subsequence of the input string, starting with
     the first non-whitespace character, that is of the expected form.

     The form of numeric data is a non-empty sequence of significand
     digits with an optional decimal point, and an optional exponent
     consisting of an exponent prefix followed by an optional sign and
     a non-empty sequence of decimal digits. A significand digit is
     either a decimal digit or a Latin letter (62 possible characters),
     with `A' = 10, `B' = 11, ..., `Z' = 35; case is ignored in bases
     less or equal to 36, in bases larger than 36, `a' = 36, `b' = 37,
     ..., `z' = 61.  The value of a significand digit must be strictly
     less than the base.  The decimal point can be either the one
     defined by the current locale or the period (the first one is
     accepted for consistency with the C standard and the practice, the
     second one is accepted to allow the programmer to provide MPFR
     numbers from strings in a way that does not depend on the current
     locale).  The exponent prefix can be `e' or `E' for bases up to
     10, or `@' in any base; it indicates a multiplication by a power
     of the base. In bases 2 and 16, the exponent prefix can also be
     `p' or `P', in which case the exponent, called _binary exponent_,
     indicates a multiplication by a power of 2 instead of the base
     (there is a difference only for base 16); in base 16 for example
     `1p2' represents 4 whereas `1@2' represents 256. The value of an
     exponent is always written in base 10.

In C, strtod is required to accept the locale-dependent version (well,
this is how the glibc developers have interpreted the standard), but
can accept more in other than the "C" locale:

    In other than the "C" locale, additional locale-specific subject
    sequence forms may be accepted.

> since there is no conflict with digit-separators which, though possible on
> output, are (as far as I recall) NOT accepted on input:  1,234.56 for
> example is recognized as 1 (the rest is junk).  So in a European locale,
> where one would write 1234,56 there should be no ambiguity if 1234.56 is
> entered, because 1.234,56 would not be acceptable as equivalent to 1234,56
> and would (in the absence of additional C-locale interpretation) be taken
> to be 1.
> 
> So I think we should define a standard format for text2interval() arguments,
> and we could compare them to the C locale for clarity.  We would obviously
> not refuse a locale-dependent variant -- it is always permissible to provide
> additional functions, or perhaps even the same function which ALSO allows
> locale-dependent forms, together with whatever might be required to avoid
> conflict with the standard syntax.
> 
> Turkish locales might have difficulty with Inf vs inf, but I can imagine
> worse: what if a locale specifies right-to-left strings?

AFAIK, this notion is only for *display* purpose. "Inf" in such a locale
would still be the "Inf" string. No differences with finite numeric
literals.

> I know that numeric literals are witten in the same order as for
> left-to-right scripts, but what about the order of bounds in the
> pair that denotes an interval? Is the low bound the first or the
> leftmost element?

The first, but I would say that the full interval literal would be
written left-to-right.

Otherwise the first character would be "]", not "[".

One may also wonder a BOM would be accepted as the first character
of a literal. Possibly as an implementation-dependent variant. Or
perhaps we should introduce a notion of canonicalization.

> Getting back to comma vs point. C syntax uses comma for separating
> elements of an initializer, among other things. How would that
> interact with numeric literals containing a comma?

A C source is always interpreted under the C locale (or similar,
e.g. to allow non-ASCII characters). This means that the decimal
point character in a C source is always a point.

> That's why locale has no impact on C literals, even though it does
> affect some common library functions.

This is for a practical reason: a C source needs to have the same
interpretation whatever the locale. Well, the character encoding
matters, and a bit more (for instance, GCC removes spaces at the
end of each line). But that's all, even though this initial
transformation is unspecified by the C standard.

-- 
Vincent Lefèvre <vincent@xxxxxxxxxx> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)