summaryrefslogtreecommitdiff
path: root/Tests/LibUnicode
AgeCommit message (Collapse)Author
2023-03-05LibUnicode: Detect ZWJ sequences when filtering by emoji presentationTimothy Flynn
This was preventing some unqualified emoji sequences from rendering properly, such as the custom SerenityOS flag. We rendered the flag correctly when given the fully qualified sequence: U+1F3F3 U+FEOF U+200D U+1F41E But were not detecting the unqualified sequence as an emoji when also filtering for emoji-presentation sequences: U+1F3F3 U+200D U+1F41E
2023-02-25LibUnicode: Add a unit test for Unicode grapheme and word segmentationTimothy Flynn
These include tests for previously broken boundary conditions.
2023-02-24LibUnicode: Add a method to check if a code point could start an emojiTimothy Flynn
2023-02-15LibUnicode: Fix typos causing text segmentation on mid-word punctuationTimothy Flynn
For example the words "can't" and "32.3" should not have boundaries detected on the "'" and "." code points, respectively. The String test cases fixed here are because "b'ar" is now considered one word.
2023-01-18LibUnicode: Parse and generate case folding code point dataTimothy Flynn
Case folding rules have a similar mapping style as special casing rules, where one code point may map to zero or more case folding rules. These will be used for case-insensitive string comparisons. To see how case folding can differ from other casing rules, consider "ß" (U+00DF): >>> "ß".lower() 'ß' >>> "ß".upper() 'SS' >>> "ß".title() 'Ss' >>> "ß".casefold() 'ss'
2023-01-16LibUnicode: Support full case folding for titlecasing a stringTimothy Flynn
Unicode declares that to titlecase a string, the first cased code point after each word boundary should be transformed to its titlecase mapping. All other codepoints are transformed to their lowercase mapping.
2023-01-16LibUnicode: Generate simple case folding mappings for titlecaseTimothy Flynn
Note we already generate the special case foldings for titlecase.
2023-01-09LibUnicode+LibJS: Propagate OOM from Unicode normalizationTimothy Flynn
2023-01-09LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformationsTimothy Flynn
2022-10-07LibUnicode: Update code point ideographic replacements for Unicode 15Timothy Flynn
2022-10-07LibUnicode: Fix Hangul syllable composition for specific casesmatcool
This fixes `combine_hangul_code_points` which would try to combine a LVT syllable with a trailing consonant, resulting in a wrong character. Also added a test for this specific case.
2022-10-06Tests: Add tests for LibUnicode's normalizematcool
2022-09-05LibLocale: Move locale source files to the LibLocale libraryTimothy Flynn
Everything is now setup to create the LibLocale library and link it where needed.
2022-09-05LibLocale: Move locale test files to the LibLocale folderTimothy Flynn
2022-09-05LibLocale: Move locale source files to the LibLocale folderTimothy Flynn
These are still included in LibUnicode, but this updates their location and the include paths of other files which include them.
2022-09-05Userland: Move files destined for LibLocale to the Locale namespaceTimothy Flynn
2022-09-05LibUnicode+Userland: Migrate generated CLDR data to LibLocaleDataTimothy Flynn
Currently, LibUnicodeData contains the generated UCD and CLDR data. Move the UCD data to the main LibUnicode library, and rename LibUnicodeData to LibLocaleData. This is another prepatory change to migrate to LibLocale.
2022-07-12Everywhere: Add sv suffix to strings relying on StringView(char const*)sin-ack
Each of these strings would previously rely on StringView's char const* constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.
2022-02-15Tests: Add Unicode tests for CharacterType block propertiesthankyouverycool
2022-01-31Everywhere: Update copyrights with my new serenityos.org e-mail :^)Timothy Flynn
2022-01-19LibJS+LibUnicode: Return the appropriate time zone name depending on DSTTimothy Flynn
2022-01-12LibUnicode: Swap the preferred order of standard time zone display namesTimothy Flynn
Our generator is currently preferring the DST variant of the time zone display names over the non-DST variant. LibTimeZone currently does not have DST support, and operates in a mode that basically assumes DST does not exist. Swap the display names for now just to be consistent until we have DST support. Note we will need to generate both of these variants and select the appropriate one at runtime once we have DST support.
2022-01-11LibUnicode: Parse and generate long and short generic time zone namesTimothy Flynn
This implements the CalendarPatternStyle::{Long,Short}Generic styles of time zone name formatting.
2022-01-11LibUnicode: Fall back to GMT offset when a time zone name is unavailableTimothy Flynn
The following table in TR-35 includes a web of fall back rules when the requested time zone style is unavailable: https://unicode.org/reports/tr35/tr35-dates.html#dfst-zone Conveniently, the subset of styles supported by ECMA-402 (and therefore LibUnicode) all either fall back to GMT offset or to a style that is unsupported but itself falls back to GMT offset.
2022-01-11LibUnicode: Implement TR-35's localized GMT offset formattingTimothy Flynn
This adds an API to use LibTimeZone to convert a time zone such as "America/New_York" to a GMT offset string like "GMT-5" (short form) or "GMT-05:00" (long form).
2022-01-06LibUnicode: Do not assume time zones & meta zones have a 1-to-1 mappingTimothy Flynn
The generator parses metaZones.json to form a mapping of meta zones to time zones (AKA "golden zone" in TR-35). This parser errantly assumed this was a 1-to-1 mapping.
2022-01-04Tests: Link some tests directly against LibUnicodeDataTimothy Flynn
These were missed in 565a880ce5a14bac817c73916e91ebfa04c8b99b. This wasn't an issue because these tests don't pledge/unveil anything, so they could happily dlopen() the library at runtime. But this is now needed in order to migrate LibUnicode towards weak symbols instead.
2021-11-30LibUnicode: Support code point names that apply to ranges of code pointsTimothy Flynn
For example, consider the following adjacent entries in UnicodeData.txt: 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;; 4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;; Our current implementation would assign the display name "CJK Ideograph Extension A" to code points U+3400 & U+4DBF, but not to the code points in between. Not only should those code points be assigned a name, but the Unicode spec also has formatting rules on what the names should be (the names for these ranged code points are not as they appear in UnicodeData.txt). The spec also defines names for code point ranges that actually are listed individually in UnicodeData.txt. For example: 2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;; 2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;; 2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;; Code points are only coalesced into a range if all fields after the name are equivalent. Our parser will insert the range and its name formatting pattern when it comes across the first code point in that range, then ignore other code points in that range. This reduces the number of names we generated by nearly 2,000.
2021-11-19LibUnicode: Support locales-without-script aliases for ECMA-402Timothy Flynn
As noted by ECMA-402, if a supported locale contains all of a language, script, and region subtag, then the implementation must also support the locale without the script subtag. The most complicated example of this is the zh-TW locale. The list of locales in the CLDR database does not include zh-TW or its maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale. However, zh-Hant-TW is listed in the default-content locale list in the cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We must then also support the zh-Hant-TW alias without the script subtag: zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite heavily tested by test262.
2021-11-09LibUnicode: Parse the CLDR's defaultContent.json locale listTimothy Flynn
This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
2021-09-08LibUnicode+LibJS: Store locale keyword values as a single stringTimothy Flynn
Previously, LibUnicode would store the values of a keyword as a Vector. For example, the locale "en-u-ca-abc-def" would have its keyword "ca" stored as {"abc, "def"}. Then, canonicalization would occur on each of the elements in that Vector. This is incorrect because, for example, the keyword value "true" should only be dropped if that is the entire value. That is, the canonical form of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change for canonicalization. However, we would canonicalize that locale as "en-u-kb-abc".
2021-09-06LibUnicode: Implement locale-aware BEFORE_DOT special casingTimothy Flynn
Note that the algorithm in the Unicode spec is for checking that a code point precedes U+0307, but the special casing condition NotBeforeDot is interested in the inverse of this rule.
2021-09-06LibUnicode: Implement locale-aware MORE_ABOVE special casingTimothy Flynn
2021-09-06LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casingTimothy Flynn
2021-09-06LibUnicode: Implement locale-aware AFTER_I special casingTimothy Flynn
2021-09-02LibUnicode: Add lexer to test if a string matches the "type" productionTimothy Flynn
2021-09-02Tests: Remove all file(GLOB) from CMakeLists in TestsAndrew Kaster
Using a file(GLOB) to find all the test files in a directory is an easy hack to get things started, but has some drawbacks. Namely, if you add a test, it won't be found again without re-running CMake. `ninja` seems to do this automatically, but it would be nice to one day stop seeing it rechecking our globbed directories.
2021-09-01LibUnicode: Resolve the most likely territory alias when there are manyTimothy Flynn
2021-09-01LibUnicode: Perform complex Unicode locale alias substitutionTimothy Flynn
2021-09-01LibUnicode: Canonicalize calendar subtagsTimothy Flynn
Calendar subtags are a bit of an odd-man-out in that we must match the variants "ethiopic-amete-alem" in that order, without any other variant in the locale. So a separate method is needed for this, and we now defer sorting the variant list until after other canonicalization is done.
2021-09-01LibUnicode: Canonicalize timezone subtagsTimothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "imperial" to "uksystem"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "names" to "prprname"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "yes" to "true"Timothy Flynn
2021-09-01LibUnicode: Substitute Unicode locale aliases during canonicalizationTimothy Flynn
Unicode TR35 defines how locale subtag aliases should be emplaced when converting a locale to canonical form. For most subtags, it is a simple substitution. Language subtags depend on context; for example, the language "sh" should become "sr-Latn", but if the original locale has a script subtag already ("sh-Cyrl"), then only the language subtag of the alias should be taken ("sr-Latn"). To facilitate this, we now make two passes when canonicalizing a locale. In the first pass, we convert the LocaleID structure to canonical syntax (where the conversions all happen in-place). In the second pass, we form the canonical string based on the canonical syntax.
2021-09-01LibJS+LibUnicode: Store parsed Unicode locale data as full stringsTimothy Flynn
Originally, it was convenient to store the parsed Unicode locale data as views into the original string being parsed. But to implement locale aliases will require mutating the data that was parsed. To prepare for that, store the parsed data as proper strings.
2021-08-30LibUnicode: Canonicalize locale private use extensionsTimothy Flynn
2021-08-30LibUnicode: Canonicalize locale extensionsTimothy Flynn
2021-08-30LibUnicode: Parse locale private use extensionsTimothy Flynn