summaryrefslogtreecommitdiff
path: root/Tests/LibUnicode
AgeCommit message (Collapse)Author
2021-11-19LibUnicode: Support locales-without-script aliases for ECMA-402Timothy Flynn
As noted by ECMA-402, if a supported locale contains all of a language, script, and region subtag, then the implementation must also support the locale without the script subtag. The most complicated example of this is the zh-TW locale. The list of locales in the CLDR database does not include zh-TW or its maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale. However, zh-Hant-TW is listed in the default-content locale list in the cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We must then also support the zh-Hant-TW alias without the script subtag: zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite heavily tested by test262.
2021-11-09LibUnicode: Parse the CLDR's defaultContent.json locale listTimothy Flynn
This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
2021-09-08LibUnicode+LibJS: Store locale keyword values as a single stringTimothy Flynn
Previously, LibUnicode would store the values of a keyword as a Vector. For example, the locale "en-u-ca-abc-def" would have its keyword "ca" stored as {"abc, "def"}. Then, canonicalization would occur on each of the elements in that Vector. This is incorrect because, for example, the keyword value "true" should only be dropped if that is the entire value. That is, the canonical form of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change for canonicalization. However, we would canonicalize that locale as "en-u-kb-abc".
2021-09-06LibUnicode: Implement locale-aware BEFORE_DOT special casingTimothy Flynn
Note that the algorithm in the Unicode spec is for checking that a code point precedes U+0307, but the special casing condition NotBeforeDot is interested in the inverse of this rule.
2021-09-06LibUnicode: Implement locale-aware MORE_ABOVE special casingTimothy Flynn
2021-09-06LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casingTimothy Flynn
2021-09-06LibUnicode: Implement locale-aware AFTER_I special casingTimothy Flynn
2021-09-02LibUnicode: Add lexer to test if a string matches the "type" productionTimothy Flynn
2021-09-02Tests: Remove all file(GLOB) from CMakeLists in TestsAndrew Kaster
Using a file(GLOB) to find all the test files in a directory is an easy hack to get things started, but has some drawbacks. Namely, if you add a test, it won't be found again without re-running CMake. `ninja` seems to do this automatically, but it would be nice to one day stop seeing it rechecking our globbed directories.
2021-09-01LibUnicode: Resolve the most likely territory alias when there are manyTimothy Flynn
2021-09-01LibUnicode: Perform complex Unicode locale alias substitutionTimothy Flynn
2021-09-01LibUnicode: Canonicalize calendar subtagsTimothy Flynn
Calendar subtags are a bit of an odd-man-out in that we must match the variants "ethiopic-amete-alem" in that order, without any other variant in the locale. So a separate method is needed for this, and we now defer sorting the variant list until after other canonicalization is done.
2021-09-01LibUnicode: Canonicalize timezone subtagsTimothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "imperial" to "uksystem"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "names" to "prprname"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "yes" to "true"Timothy Flynn
2021-09-01LibUnicode: Substitute Unicode locale aliases during canonicalizationTimothy Flynn
Unicode TR35 defines how locale subtag aliases should be emplaced when converting a locale to canonical form. For most subtags, it is a simple substitution. Language subtags depend on context; for example, the language "sh" should become "sr-Latn", but if the original locale has a script subtag already ("sh-Cyrl"), then only the language subtag of the alias should be taken ("sr-Latn"). To facilitate this, we now make two passes when canonicalizing a locale. In the first pass, we convert the LocaleID structure to canonical syntax (where the conversions all happen in-place). In the second pass, we form the canonical string based on the canonical syntax.
2021-09-01LibJS+LibUnicode: Store parsed Unicode locale data as full stringsTimothy Flynn
Originally, it was convenient to store the parsed Unicode locale data as views into the original string being parsed. But to implement locale aliases will require mutating the data that was parsed. To prepare for that, store the parsed data as proper strings.
2021-08-30LibUnicode: Canonicalize locale private use extensionsTimothy Flynn
2021-08-30LibUnicode: Canonicalize locale extensionsTimothy Flynn
2021-08-30LibUnicode: Parse locale private use extensionsTimothy Flynn
2021-08-30LibUnicode: Parse locale extensions of the other extension formTimothy Flynn
2021-08-30LibUnicode: Parse locale extensions of the transformed extension formTimothy Flynn
2021-08-30LibUnicode: Parse locale extensions of the Unicode locale extension formTimothy Flynn
2021-08-26LibUnicode: Implement grammar validators for Unicode TR-35Timothy Flynn
ECMA-402 requires validating user input against the EBNF grammar for Unicode locales described in TR-35: https://www.unicode.org/reports/tr35 This commit adds validators for that grammar, as well as other helper to e.g. canonicalize a locale string.
2021-08-11LibUnicode: Handle edge-case script extensions, Common and InheritedTimothy Flynn
These script extensions have some peculiar behavior in the Unicode spec. The UCD ScriptExtension file does not contain these scripts. Rather, it is implied the code points which have these scripts as an extension are the code points that both: 1. Have Common or Inherited as their primary script value 2. Do not have any other script value in their script extension lists Because these are not explictly listed in the UCD, we must manually form these script extensions.
2021-08-11LibUnicode: Generate separate tables for scripts and script extensionsTimothy Flynn
Notice that unlike the note in populate_general_category_unions(), script extension do indeed have code point ranges which overlap. Thus, this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11LibUnicode: Generate separate tables for Unicode propertiesTimothy Flynn
Similar to General Categories, this generates separate tables for the Property list.
2021-08-11LibUnicode: Include Unassigned code points in the Other General CategoryTimothy Flynn
Now that the generator parses unassigned General Category properties, it can include Unassigned (Cn) in the Other (C) category.
2021-08-11LibUnicode: Generate separate tables for General Category propertiesTimothy Flynn
Previously, each code point's General Category was part of the generated UnicodeData structure. This ultimately presented two problems, one functional and one performance related: * Some General Categories are applied to unassigned code points, for example the Unassigned (Cn) category. Unassigned code points are strictly excluded from UnicodeData.txt, so by relying on that file, the generator is unable to handle these categories. * Lookups for General Categories are slower when searching through the large UnicodeData hash map. Even though lookups are O(1), the hash function turned out to be slower than binary searching through a category-specific table. So, now a table is generated for each General Category. When querying a code point for a category, a binary search is done on each code point range in that category's table to check if code point has that category. Further, General Categories are now parsed from the UCD file DerivedGeneralCategory.txt. This file is a normal "prop list" file and contains the categories for unassigned code points.
2021-07-28LibUnicode: Handle code points that are both cased and case-ignorableTimothy Flynn
Apparently, some code points fit both categories, for example U+0345 (COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if a code point is a final code point in a string.
2021-07-28LibUnicode: Check word break when deciding on case-ignorable code pointsTimothy Flynn
2021-07-28LibUnicode: Check property list when deciding if a code point is casedTimothy Flynn
2021-07-27LibUnicode: Begin implementing special Unicode case foldingTimothy Flynn
This implements unconditional special case folding, and conditional folding for non-locale cases. Worth noting that the only conditional, non-locale special case is for converting an uppercase sigma to lowercase.
2021-07-26LibUnicode: Introduce a Unicode library for interacting with UCD filesTimothy Flynn
The Unicode standard publishes the Unicode Character Database (UCD) with information about every code point, such as each code point's upper case mapping. LibUnicode exists to download and parse UCD files at build time and to provide accessors to that data. As a start, LibUnicode includes upper- and lower-case code point converters.