Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: [P1619-3] I18n of SO_GUID



On 2008-May-28, at 15:17, Luther Martin wrote:

> OK, here's an attempt at the ABNF definition of the URL SOGUID that  
> allows UTF-8.
>
> I've validated the ABNF syntax.
> <abnf with utf8.txt>

I do not recommend mixing characters with octet concepts as it breaks  
size constraints.

Do really need the complexity of UTF-8 character validation given that  
handles names, as presented to the API, may contain octets filled any  
binary pattern?  I don't think we do.

I simpler approach may to be avoid the raw encoding of the UTF8 shift  
characters.  I.e.: (revised from my earlier EMail):

	# UFT8-SAFE non-dot octet avoids NUL (%x00)
	# UFT8-SAFE non-dot octet avoids space (%x20) thru / (%x2F)
	# UFT8-SAFE non-dot octet avoids : (%x3A) thru @ (%x40)
	# UFT8-SAFE non-dot octet avoids [ (%x5B) thru ` (%x60)
	# UFT8-SAFE non-dot octet avoids { (%x7B) thru del (%x7F)
	# UFT8-SAFE non-dot octet avoids %x80-%xFF

	# UFT8-SAFE non-dot octet allows %x01 thru %x1F
	# UFT8-SAFE non-dot octet allows 0-9 (%x30-%x39)
	# UFT8-SAFE non-dot octet allows A-Z (%x41-%x5A)
	# UFT8-SAFE non-dot octet allows a-z (%x61-%x7A)

	o <UFT8-SAFE non-dot octet> = (%x01-%x1F / %x30-%x39 / %x41-%x5A /  
%x61-%x7A)

	o <UFT8-SAFE octet> = <SAFE non-dot octet> / <dot>

	...

	• <SO_Handle> = <UFT8-SAFE handle> / <non-UFT8-SAFE encoded handle>

	o <UFT8-SAFE handle> = (ALPHA / DIGIT) 0*254 <UFT8-SAFE octet>

	o <non-UFT8-SAFE encoded handle> = <handle first octet> 0*254 <UFT8- 
SAFE next octet>

	o <handle first octet> = ALPHA / DIGIT / <non-alphanumeric encoded>

	o <non-alphanumeric encoded> = “%” (“0” / “1” / “2”) <hex>
		o <non-alphanumeric encoded> =/ “3” (“A” / “B” / “C” / “D” / “E” /  
“F”)
		o <non-alphanumeric encoded> =/ “%” “4” “0”
		o <non-alphanumeric encoded> =/ “%” “5” (“B” / “C” / “D” / “E” / “F”)
		o <non-alphanumeric encoded> =/ “%” “6” “0”
		o <non-alphanumeric encoded> =/ “%” “7” (“B” / “C” / “D” / “E” / “F” )
		o <non-alphanumeric encoded> =/ “%” (“8” / “9” / A” / “B” / “C” /  
“D” / “E” / “F”) <hex>

	o <UFT8-SAFE next octet> = <UFT8-SAFE octet> / <dash> /  
<underscore> / <UFT8-UNSAFE encoded octet>

	# UFT8-UNSAFE encoded octet encodes any octet that is not <UFT8-SAFE  
octet> nor <dash> nor <underscore>

	o <UFT8-UNSAFE encoded octet> = “%” "0" "0"
		o <UFT8-UNSAFE encoded octet> =/ “2” <digit>
		o <UFT8-UNSAFE encoded octet> =/ “2” (“A” / “B” / “C” / “F”)
		o <UFT8-UNSAFE encoded octet> =/ “3” (“A” / “B” / “C” / ”D” / “E” /  
“F”)
		o <UFT8-UNSAFE encoded octet> =/ “%” “4” “0”
		o <UFT8-UNSAFE encoded octet> =/ “%” “5” (“B” / “C” / “D” / “E”)
		o <UFT8-UNSAFE encoded octet> =/ “%” “6” “0”
		o <UFT8-UNSAFE encoded octet> =/ “%” “7” (“B” / “C” / “D” / “E” / “F”)
		o <UFT8-UNSAFE encoded octet> =/ “%” (“8” / “9” / A” / “B” / "C" /  
"D" / "E" / "F") <hex>

The above avoids the non-POSIX issues, avoids conflicts with the  
reserved namespaces, preserves size limits and remains UTF-8 safe  
because the UTF-8 shift characters are encoded.

chongo () /\oo/\