summaryrefslogtreecommitdiff
path: root/Tests/LibUnicode/TestUnicodeLocale.cpp
AgeCommit message (Collapse)Author
2021-11-19LibUnicode: Support locales-without-script aliases for ECMA-402Timothy Flynn
As noted by ECMA-402, if a supported locale contains all of a language, script, and region subtag, then the implementation must also support the locale without the script subtag. The most complicated example of this is the zh-TW locale. The list of locales in the CLDR database does not include zh-TW or its maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale. However, zh-Hant-TW is listed in the default-content locale list in the cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We must then also support the zh-Hant-TW alias without the script subtag: zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite heavily tested by test262.
2021-11-09LibUnicode: Parse the CLDR's defaultContent.json locale listTimothy Flynn
This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
2021-09-08LibUnicode+LibJS: Store locale keyword values as a single stringTimothy Flynn
Previously, LibUnicode would store the values of a keyword as a Vector. For example, the locale "en-u-ca-abc-def" would have its keyword "ca" stored as {"abc, "def"}. Then, canonicalization would occur on each of the elements in that Vector. This is incorrect because, for example, the keyword value "true" should only be dropped if that is the entire value. That is, the canonical form of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change for canonicalization. However, we would canonicalize that locale as "en-u-kb-abc".
2021-09-02LibUnicode: Add lexer to test if a string matches the "type" productionTimothy Flynn
2021-09-01LibUnicode: Resolve the most likely territory alias when there are manyTimothy Flynn
2021-09-01LibUnicode: Perform complex Unicode locale alias substitutionTimothy Flynn
2021-09-01LibUnicode: Canonicalize calendar subtagsTimothy Flynn
Calendar subtags are a bit of an odd-man-out in that we must match the variants "ethiopic-amete-alem" in that order, without any other variant in the locale. So a separate method is needed for this, and we now defer sorting the variant list until after other canonicalization is done.
2021-09-01LibUnicode: Canonicalize timezone subtagsTimothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "imperial" to "uksystem"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "names" to "prprname"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "yes" to "true"Timothy Flynn
2021-09-01LibUnicode: Substitute Unicode locale aliases during canonicalizationTimothy Flynn
Unicode TR35 defines how locale subtag aliases should be emplaced when converting a locale to canonical form. For most subtags, it is a simple substitution. Language subtags depend on context; for example, the language "sh" should become "sr-Latn", but if the original locale has a script subtag already ("sh-Cyrl"), then only the language subtag of the alias should be taken ("sr-Latn"). To facilitate this, we now make two passes when canonicalizing a locale. In the first pass, we convert the LocaleID structure to canonical syntax (where the conversions all happen in-place). In the second pass, we form the canonical string based on the canonical syntax.
2021-09-01LibJS+LibUnicode: Store parsed Unicode locale data as full stringsTimothy Flynn
Originally, it was convenient to store the parsed Unicode locale data as views into the original string being parsed. But to implement locale aliases will require mutating the data that was parsed. To prepare for that, store the parsed data as proper strings.
2021-08-30LibUnicode: Canonicalize locale private use extensionsTimothy Flynn
2021-08-30LibUnicode: Canonicalize locale extensionsTimothy Flynn
2021-08-30LibUnicode: Parse locale private use extensionsTimothy Flynn
2021-08-30LibUnicode: Parse locale extensions of the other extension formTimothy Flynn
2021-08-30LibUnicode: Parse locale extensions of the transformed extension formTimothy Flynn
2021-08-30LibUnicode: Parse locale extensions of the Unicode locale extension formTimothy Flynn
2021-08-26LibUnicode: Implement grammar validators for Unicode TR-35Timothy Flynn
ECMA-402 requires validating user input against the EBNF grammar for Unicode locales described in TR-35: https://www.unicode.org/reports/tr35 This commit adds validators for that grammar, as well as other helper to e.g. canonicalize a locale string.