summaryrefslogtreecommitdiff
path: root/Userland/Libraries/LibUnicode
AgeCommit message (Collapse)Author
2022-12-14LibUnicode: Fix compilation when the UCD download is disabledTimothy Flynn
2022-12-06Everywhere: Rename to_{string => deprecated_string}() where applicableLinus Groh
This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.
2022-12-06AK+Everywhere: Rename String to DeprecatedStringLinus Groh
We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)
2022-11-06Meta+LibUnicode: Avoid relocations for static unicode dataGunnar Beutner
Previously the s_decomposition_mappings variable would refer to other data in s_decomposition_mappings_data. This would cause thousands of avoidable relocations at load time. This saves about 128kB RAM for each process which uses LibUnicode.
2022-11-01Everywhere: Mark dependencies of most targets as PRIVATETim Schumacher
Otherwise, we end up propagating those dependencies into targets that link against that library, which creates unnecessary link-time dependencies. Also included are changes to readd now missing dependencies to tools that actually need them.
2022-10-17Lagom+CMake: Propagate dependencies for generated custom targetsAndrew Kaster
We have logic for serenity_generated_sources which works well for source files that are specified in GENERATED_SOURCES prior to calling serenity_lib or serenity_bin. However, code generated with invoke_generator, and the LibWeb generators do not always follow the pattern of the IDL and GML files. For the LibWeb generators, we can just add_dependencies to LibWeb at the time we declare the generate_Foo custom target. However for LibLocale, LibTimeZone, and LibUnicode, we don't have the name of the target available, so export the name in a variable to set into GENERATED_SOURCES. To make this work for Lagom, we need to make sure that lagom_lib and serenity_bin in Lagom/CMakeLists.txt call serenity_generated_sources on the target. This enables the Xcode generator on macOS hosts, at least for Lagom.
2022-10-07LibUnicode: Fix Hangul syllable composition for specific casesmatcool
This fixes `combine_hangul_code_points` which would try to combine a LVT syllable with a trailing consonant, resulting in a wrong character. Also added a test for this specific case.
2022-10-06LibUnicode: Add to-and-from string converters for NormalizationFormTimothy Flynn
2022-10-06LibUnicode: Add decomposition mappings and Unicode normalizationmatcool
The mappings are exposed via `Unicode::code_point_decomposition(u32)` and `Unicode::code_point_decompositions()`, the latter being useful for reverse searching a code point from its decomposition. The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization), meaning no quick check optimizations.
2022-09-11LibUnicode: Parse and generate custom emoji added for SerenityOSTimothy Flynn
Parse emoji from emoji-serenity.txt to allow displaying their names and grouping them together in the EmojiInputDialog. This also adds an "Unknown" value to the EmojiGroup enum. This will be useful for emoji that aren't found in the UCD, or for when UCD downloads are disabled.
2022-09-08LibUncode: Parse and generate emoji code point dataTimothy Flynn
According to TR #51, the "best definition of the full set [of emojis] is in the emoji-test.txt file". This defines not only the emoji themselves, but the order in which they should be displayed, and what "group" of emojis they belong to.
2022-09-05LibLocale: Move locale source files to the LibLocale libraryTimothy Flynn
Everything is now setup to create the LibLocale library and link it where needed.
2022-09-05LibUnicode: Generate a separate Locale enumeration for special casingTimothy Flynn
The UCD only cares about a few locales for special casing rules (az, lt, and tr). Unfortunately, LibUnicode cannot use LibLocale once the libraries are separate because LibLocale will need to use LibUnicode for many more things; thus there would be a circular dependency. Instead, just generate the small enum needed for this one use case.
2022-09-05LibLocale: Move locale source files to the LibLocale folderTimothy Flynn
These are still included in LibUnicode, but this updates their location and the include paths of other files which include them.
2022-09-05Userland: Move files destined for LibLocale to the Locale namespaceTimothy Flynn
2022-09-05LibUnicode+LibJS: Move Unicode::get_available_currencies() to Locale.hTimothy Flynn
This is generated by GenerateLocaleData, which will soon be in the Locale namespace. Move it out of CurrencyCode.h, as that will continue to live in the Unicode namespace.
2022-09-05LibLocale+LibUnicode: Move generated CLDR data files to LibLocale folderTimothy Flynn
They are still included into LibUnicode, but this moves their generated location to be under LibLocale.
2022-09-05LibUnicode+Userland: Migrate generated CLDR data to LibLocaleDataTimothy Flynn
Currently, LibUnicodeData contains the generated UCD and CLDR data. Move the UCD data to the main LibUnicode library, and rename LibUnicodeData to LibLocaleData. This is another prepatory change to migrate to LibLocale.
2022-09-05LibUnicode: Move CLDR data generators to a LibLocale subfolderTimothy Flynn
To prepare for placing all CLDR generated data in a new library, LibLocale, this moves the code generators for the CLDR data to the LibLocale subfolder.
2022-09-05LibUnicode: Fully qualify use of AK::Variant in Locale.hTimothy Flynn
The generated locale data contains an enum also named Variant, as variants are part of locale strings. This hasn't been an issue, but as includes are reordered, the order in which the enum and AK::Variant are included may cause an ambiguity error.
2022-08-25LibUnicode: Fix compilation when ENABLE_UNICODE_DATABASE_DOWNLOAD is OFFTimothy Flynn
2022-07-21LibUnicode: Generate per-locale data for the "noon" fixed day periodTimothy Flynn
Note that not all locales have this day period.
2022-07-20LibUnicode: Implement the range pattern processing algorithmTimothy Flynn
This algorithm is to inject spacing around the range separator under certain conditions. For example, in en-US, the range [3, 5] should be formatted as "3–5" if unitless, but as "$3 – $5" for currency.
2022-07-20LibUnicode: Generate per-locale approximately & range separator symbolsTimothy Flynn
2022-07-15LibUnicode: Remove obsolete Unicode::get_default_number_systemTimothy Flynn
This has been superseded by get_preferred_keyword_value_for_locale, which doesn't require allocating a Vector just to return its first element.
2022-07-15LibUnicode: Generate a method to lookup locale-preferred keyword valuesTimothy Flynn
2022-07-15LibUnicode: Generate a method to lookup available keyword valuesTimothy Flynn
2022-07-15LibUnicode: Generate available values for the keywords co, kf, kn, hcTimothy Flynn
This also ensures we only include values we actually support in the generated list of available values.
2022-07-12LibUnicode: Parse and generate per-locale plural rangesTimothy Flynn
2022-07-08LibUnicode: Remove now-unused Unicode::select_pattern_with_pluralityTimothy Flynn
2022-07-08LibUnicode: Replace NumberFormat::Plurality with Unicode::PluralCategoryTimothy Flynn
To prepare for using plural rules within number & duration format, this removes the NumberFormat::Plurality enumeration. This also adds PluralCategory::ExactlyZero & PluralCategory::ExactlyOne. These are used in locales like French, where PluralCategory::One really means any value from 0.00 to 1.99. PluralCategory::ExactlyOne means only the value 1, as the name implies. These exact rules are not known by the general plural rules, they are explicitly for number / currency format.
2022-07-08LibJS+LibUnicode: Do not generate the PluralCategory enumTimothy Flynn
The PluralCategory enum is currently generated for plural rules. Instead of generating it, this moves the enum to the public LibUnicode header. While it was nice to auto-discover these values, they are well defined by TR-35, and we will need their values from within the number format code generator (which can't rely on the plural rules generator having run yet). Further, number format will require additional values in the enum that plural rules doesn't know about.
2022-07-08LibJS: Use Intl.PluralRules within Intl.RelativeFormatTimothy Flynn
The Polish test cases added here cover previous failures from test262, due to the way that 0 is specified to be "many" in Polish.
2022-07-08LibUnicode: Generate a list of available plural categories per localeTimothy Flynn
Separate lists are generated for cardinal and ordinal form.
2022-07-08LibUnicode: Parse and generate per-locale plural rules from the CLDRTimothy Flynn
Plural rules in the CLDR are of the form: "cs": { "pluralRule-count-one": "i = 1 and v = 0 @integer 1", "pluralRule-count-few": "i = 2..4 and v = 0 @integer 2~4", "pluralRule-count-many": "v != 0 @decimal 0.0~1.5, 10.0, 100.0 ...", "pluralRule-count-other": "@integer 0, 5~19, 100, 1000, 10000 ..." } The syntax is described here: https://unicode.org/reports/tr35/tr35-numbers.html#Plural_rules_syntax There are up to 2 sets of rules for each locale, a cardinal set and an ordinal set. The approach here is to generate a C++ function for each set of rules. Each condition in the rules (e.g. "i = 1 and v = 0") is transpiled to a C++ if-statement within its function. Then lookup tables are generated to match locales to their generated functions. NOTE: -Wno-parentheses-equality is added to the LibUnicodeData compile flags because the generated plural rules have lots of extra parentheses (because e.g. we need to selectively negate and combine rules). The code to generate only exactly the right number of parentheses is quite hairy, so this just tells the compiler to ignore the extras.
2022-07-06LibUnicode: Generate per-region week dataTimothy Flynn
This includes: * The minimum number of days in a week for that week to count as the first week of a new year. * The day to be shown as the first day of the week in a calendar. * The start/end days of the weekend. Like the existing hour cycle data, week data is presented per-region in the CLDR, rather than per-locale. The method to add likely subtags to a locale to perform region lookups is the same. The list of regions in the CLDR for hour cycle, minimum days, first day, and weekend days are quite different. So rather than changing the existing HourCycleRegion enum to a generic Region enum, we generate separate enums for each of the week data fields. This allows each lookup into these fields to remain simple array-based index access, without any "jumps" for regions that don't have CLDR data for a field.
2022-07-06LibUnicode: Generate per-locale text layout informationTimothy Flynn
Currently contains just each locale's character order, but is set up to easily add other text layout fields from the CLDR if ECMA-402 eventually requires them.
2022-07-06AK: Use an enum instead of a bool for String::replace(all_occurences)DexesTTP
This commit has no behavior changes. In particular, this does not fix any of the wrong uses of the previous default parameter (which used to be 'false', meaning "only replace the first occurence in the string"). It simply replaces the default uses by String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
2022-07-01LibUnicode: Extract the timeSeparator numeric symbol from CLDRIdan Horowitz
This will be used by Intl.DurationFormat
2022-04-07LibUnicode: Upgrade to CLDR version 41.0.0Timothy Flynn
Release notes: https://cldr.unicode.org/index/downloads/cldr-41 Note that the HourCycleRegion enum now contains 272 entires, thus needs to be bumped from u8 to u16.
2022-02-16LibUnicode: Use BCP 47 data to filter valid calendar namesTimothy Flynn
2022-02-16LibUnicode: Use BCP 47 data to filter valid numbering system namesTimothy Flynn
There isn't too much of an effective difference here other than that the BCP 47 data contains some aliases we would otherwise not handle.
2022-02-16LibUnicode: Use BCP 47 data to generate available calendars and numbersTimothy Flynn
BCP 47 will be the single source of truth for known calendar and number system keywords, and their aliases (e.g. "gregory" is an alias for "gregorian"). Move the generation of available keywords to where we parse the BCP 47 data, so that hard-coded aliases may be removed from other generators.
2022-02-16LibJS+LibUnicode: Parse Unicode keywords from the BCP 47 CLDR packageTimothy Flynn
We have a fair amount of hard-coded keywords / aliases that can now be replaced with real data from BCP 47. As a result, the also changes the awkward way we were previously generating keys. Before, we were more or less generating keywords as a CSV list of keys, e.g. for the "nu" key, we'd generate "latn,arab,grek" (ordered by locale preference). Then at runtime, we'd split on the comma. We now just generate spans of keywords directly.
2022-02-15Meta+LibUnicode: Download and parse Unicode block propertiesthankyouverycool
This parses Blocks.txt for CharacterType properties and creates a global display array for use in apps.
2022-01-31LibUnicode: Implement sentence segmentationIdan Horowitz
2022-01-31LibUnicode: Implement word segmentationIdan Horowitz
2022-01-31LibUnicode: Implement grapheme segmentationIdan Horowitz
2022-01-31LibUnicode: Download and parse {Grapheme,Word,Sentence} break propsIdan Horowitz
2022-01-31Everywhere: Update copyrights with my new serenityos.org e-mail :^)Timothy Flynn