serenity - The Serenity Operating System 🐞

Age	Commit message (Collapse)	Author
2021-11-19	LibUnicode: Support locales-without-script aliases for ECMA-402	Timothy Flynn
	As noted by ECMA-402, if a supported locale contains all of a language, script, and region subtag, then the implementation must also support the locale without the script subtag. The most complicated example of this is the zh-TW locale. The list of locales in the CLDR database does not include zh-TW or its maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale. However, zh-Hant-TW is listed in the default-content locale list in the cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We must then also support the zh-Hant-TW alias without the script subtag: zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite heavily tested by test262.
2021-11-09	LibUnicode: Parse the CLDR's defaultContent.json locale list	Timothy Flynn
	This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
2021-09-08	LibUnicode+LibJS: Store locale keyword values as a single string	Timothy Flynn
	Previously, LibUnicode would store the values of a keyword as a Vector. For example, the locale "en-u-ca-abc-def" would have its keyword "ca" stored as {"abc, "def"}. Then, canonicalization would occur on each of the elements in that Vector. This is incorrect because, for example, the keyword value "true" should only be dropped if that is the entire value. That is, the canonical form of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change for canonicalization. However, we would canonicalize that locale as "en-u-kb-abc".
2021-09-06	LibUnicode: Implement locale-aware BEFORE_DOT special casing	Timothy Flynn
	Note that the algorithm in the Unicode spec is for checking that a code point precedes U+0307, but the special casing condition NotBeforeDot is interested in the inverse of this rule.
2021-09-06	LibUnicode: Implement locale-aware MORE_ABOVE special casing	Timothy Flynn

2021-09-06	LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casing	Timothy Flynn

2021-09-06	LibUnicode: Implement locale-aware AFTER_I special casing	Timothy Flynn

2021-09-02	LibUnicode: Add lexer to test if a string matches the "type" production	Timothy Flynn

2021-09-02	Tests: Remove all file(GLOB) from CMakeLists in Tests	Andrew Kaster
	Using a file(GLOB) to find all the test files in a directory is an easy hack to get things started, but has some drawbacks. Namely, if you add a test, it won't be found again without re-running CMake. `ninja` seems to do this automatically, but it would be nice to one day stop seeing it rechecking our globbed directories.
2021-09-01	LibUnicode: Resolve the most likely territory alias when there are many	Timothy Flynn

2021-09-01	LibUnicode: Perform complex Unicode locale alias substitution	Timothy Flynn

2021-09-01	LibUnicode: Canonicalize calendar subtags	Timothy Flynn
	Calendar subtags are a bit of an odd-man-out in that we must match the variants "ethiopic-amete-alem" in that order, without any other variant in the locale. So a separate method is needed for this, and we now defer sorting the variant list until after other canonicalization is done.
2021-09-01	LibUnicode: Canonicalize timezone subtags	Timothy Flynn

2021-09-01	LibUnicode: Canonicalize the subtag "imperial" to "uksystem"	Timothy Flynn

2021-09-01	LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN"	Timothy Flynn

2021-09-01	LibUnicode: Canonicalize the subtag "names" to "prprname"	Timothy Flynn

2021-09-01	LibUnicode: Canonicalize the subtag "yes" to "true"	Timothy Flynn

2021-09-01	LibUnicode: Substitute Unicode locale aliases during canonicalization	Timothy Flynn
	Unicode TR35 defines how locale subtag aliases should be emplaced when converting a locale to canonical form. For most subtags, it is a simple substitution. Language subtags depend on context; for example, the language "sh" should become "sr-Latn", but if the original locale has a script subtag already ("sh-Cyrl"), then only the language subtag of the alias should be taken ("sr-Latn"). To facilitate this, we now make two passes when canonicalizing a locale. In the first pass, we convert the LocaleID structure to canonical syntax (where the conversions all happen in-place). In the second pass, we form the canonical string based on the canonical syntax.
2021-09-01	LibJS+LibUnicode: Store parsed Unicode locale data as full strings	Timothy Flynn
	Originally, it was convenient to store the parsed Unicode locale data as views into the original string being parsed. But to implement locale aliases will require mutating the data that was parsed. To prepare for that, store the parsed data as proper strings.
2021-08-30	LibUnicode: Canonicalize locale private use extensions	Timothy Flynn

2021-08-30	LibUnicode: Canonicalize locale extensions	Timothy Flynn

2021-08-30	LibUnicode: Parse locale private use extensions	Timothy Flynn

2021-08-30	LibUnicode: Parse locale extensions of the other extension form	Timothy Flynn

2021-08-30	LibUnicode: Parse locale extensions of the transformed extension form	Timothy Flynn

2021-08-30	LibUnicode: Parse locale extensions of the Unicode locale extension form	Timothy Flynn

2021-08-26	LibUnicode: Implement grammar validators for Unicode TR-35	Timothy Flynn
	ECMA-402 requires validating user input against the EBNF grammar for Unicode locales described in TR-35: https://www.unicode.org/reports/tr35 This commit adds validators for that grammar, as well as other helper to e.g. canonicalize a locale string.
2021-08-11	LibUnicode: Handle edge-case script extensions, Common and Inherited	Timothy Flynn
	These script extensions have some peculiar behavior in the Unicode spec. The UCD ScriptExtension file does not contain these scripts. Rather, it is implied the code points which have these scripts as an extension are the code points that both: 1. Have Common or Inherited as their primary script value 2. Do not have any other script value in their script extension lists Because these are not explictly listed in the UCD, we must manually form these script extensions.
2021-08-11	LibUnicode: Generate separate tables for scripts and script extensions	Timothy Flynn
	Notice that unlike the note in populate_general_category_unions(), script extension do indeed have code point ranges which overlap. Thus, this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11	LibUnicode: Generate separate tables for Unicode properties	Timothy Flynn
	Similar to General Categories, this generates separate tables for the Property list.
2021-08-11	LibUnicode: Include Unassigned code points in the Other General Category	Timothy Flynn
	Now that the generator parses unassigned General Category properties, it can include Unassigned (Cn) in the Other (C) category.
2021-08-11	LibUnicode: Generate separate tables for General Category properties	Timothy Flynn
	Previously, each code point's General Category was part of the generated UnicodeData structure. This ultimately presented two problems, one functional and one performance related: * Some General Categories are applied to unassigned code points, for example the Unassigned (Cn) category. Unassigned code points are strictly excluded from UnicodeData.txt, so by relying on that file, the generator is unable to handle these categories. * Lookups for General Categories are slower when searching through the large UnicodeData hash map. Even though lookups are O(1), the hash function turned out to be slower than binary searching through a category-specific table. So, now a table is generated for each General Category. When querying a code point for a category, a binary search is done on each code point range in that category's table to check if code point has that category. Further, General Categories are now parsed from the UCD file DerivedGeneralCategory.txt. This file is a normal "prop list" file and contains the categories for unassigned code points.
2021-07-28	LibUnicode: Handle code points that are both cased and case-ignorable	Timothy Flynn
	Apparently, some code points fit both categories, for example U+0345 (COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if a code point is a final code point in a string.
2021-07-28	LibUnicode: Check word break when deciding on case-ignorable code points	Timothy Flynn

2021-07-28	LibUnicode: Check property list when deciding if a code point is cased	Timothy Flynn

2021-07-27	LibUnicode: Begin implementing special Unicode case folding	Timothy Flynn
	This implements unconditional special case folding, and conditional folding for non-locale cases. Worth noting that the only conditional, non-locale special case is for converting an uppercase sigma to lowercase.
2021-07-26	LibUnicode: Introduce a Unicode library for interacting with UCD files	Timothy Flynn
	The Unicode standard publishes the Unicode Character Database (UCD) with information about every code point, such as each code point's upper case mapping. LibUnicode exists to download and parse UCD files at build time and to provide accessors to that data. As a start, LibUnicode includes upper- and lower-case code point converters.