Age | Commit message (Collapse) | Author |
|
As noted by ECMA-402, if a supported locale contains all of a language,
script, and region subtag, then the implementation must also support the
locale without the script subtag. The most complicated example of this
is the zh-TW locale.
The list of locales in the CLDR database does not include zh-TW or its
maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale.
However, zh-Hant-TW is listed in the default-content locale list in the
cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We
must then also support the zh-Hant-TW alias without the script subtag:
zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite
heavily tested by test262.
|
|
This file contains the list of locales which default to their parent
locale's values. In the core CLDR dataset, these locales have their own
files, but they are empty (except for identity data). For example:
https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml
In the JSON export, these files are excluded, so we currently are not
recognizing these locales just by iterating the locale files.
This is a prerequisite for upgrading to CLDR version 40. One of these
default-content locales is the popular "en-US" locale, which defaults to
"en" values. We were previously inferring the existence of this locale
from the "en-US-POSIX" locale (many implementations, including ours,
strip variants such as POSIX). However, v40 removes the "en-US-POSIX"
locale entirely, meaning that without this change, we wouldn't know that
"en-US" exists (we would default to "en").
For more detail on this and other v40 changes, see:
https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
|
|
Previously, LibUnicode would store the values of a keyword as a Vector.
For example, the locale "en-u-ca-abc-def" would have its keyword "ca"
stored as {"abc, "def"}. Then, canonicalization would occur on each of
the elements in that Vector.
This is incorrect because, for example, the keyword value "true" should
only be dropped if that is the entire value. That is, the canonical form
of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change
for canonicalization. However, we would canonicalize that locale as
"en-u-kb-abc".
|
|
|
|
|
|
|
|
Calendar subtags are a bit of an odd-man-out in that we must match the
variants "ethiopic-amete-alem" in that order, without any other variant
in the locale. So a separate method is needed for this, and we now defer
sorting the variant list until after other canonicalization is done.
|
|
|
|
|
|
|
|
|
|
|
|
Unicode TR35 defines how locale subtag aliases should be emplaced when
converting a locale to canonical form. For most subtags, it is a simple
substitution. Language subtags depend on context; for example, the
language "sh" should become "sr-Latn", but if the original locale has a
script subtag already ("sh-Cyrl"), then only the language subtag of the
alias should be taken ("sr-Latn").
To facilitate this, we now make two passes when canonicalizing a locale.
In the first pass, we convert the LocaleID structure to canonical syntax
(where the conversions all happen in-place). In the second pass, we form
the canonical string based on the canonical syntax.
|
|
Originally, it was convenient to store the parsed Unicode locale data as
views into the original string being parsed. But to implement locale
aliases will require mutating the data that was parsed. To prepare for
that, store the parsed data as proper strings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35
This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.
|