summaryrefslogtreecommitdiff
path: root/Userland/Libraries/LibUnicode/Locale.cpp
AgeCommit message (Collapse)Author
2022-02-16LibUnicode: Use BCP 47 data to generate available calendars and numbersTimothy Flynn
BCP 47 will be the single source of truth for known calendar and number system keywords, and their aliases (e.g. "gregory" is an alias for "gregorian"). Move the generation of available keywords to where we parse the BCP 47 data, so that hard-coded aliases may be removed from other generators.
2022-02-16LibJS+LibUnicode: Parse Unicode keywords from the BCP 47 CLDR packageTimothy Flynn
We have a fair amount of hard-coded keywords / aliases that can now be replaced with real data from BCP 47. As a result, the also changes the awkward way we were previously generating keys. Before, we were more or less generating keywords as a CSV list of keys, e.g. for the "nu" key, we'd generate "latn,arab,grek" (ordered by locale preference). Then at runtime, we'd split on the comma. We now just generate spans of keywords directly.
2022-01-31Everywhere: Update copyrights with my new serenityos.org e-mail :^)Timothy Flynn
2022-01-25LibJS+LibUnicode: Convert Intl.ListFormat to use Unicode::StyleTimothy Flynn
Remove ListFormat's own definition of the Style enum, which was further duplicated by a generated ListPatternStyle enum with the same values.
2022-01-25LibUnicode: Add helper methods to convert a Style to and from a stringTimothy Flynn
This conversion is duplicated a few times in our Intl implementation, so let's just define these once and be done with it.
2022-01-13LibUnicode: Add a method to combine locale subtags into a display stringTimothy Flynn
This is just a convenience wrapper around the underlying generated APIs.
2022-01-13LibUnicode: Parse and generate locale display patternsTimothy Flynn
These patterns indicate how to display locale strings when that locale contains multiple subtags. For example, "en-US" would be displayed as "English (United States)".
2022-01-13LibJS+LibUnicode: Remove unnecessary locale currency mapping wrapperTimothy Flynn
Before LibUnicode generated methods were weakly linked, we had a public method (get_locale_currency_mapping) for retrieving currency mappings. That method invoked one of several style-specific methods that only existed in the generated UnicodeLocale. One caveat of weakly linked functions is that every such function must have a public declaration. The result is that each of those styled methods are declared publicly, which makes the wrapper redundant because it is just as easy to invoke the method for the desired style.
2022-01-13LibUnicode: Parse and generate locale display names for date fieldsTimothy Flynn
2022-01-13LibUnicode: Parse and generate locale display names for calendarsTimothy Flynn
Note there's a bit of an unfortunate duplication in the calendar enum generated by UnicodeLocale and the existing enum generated by UnicodeDateTimeFormat. The former contains every calendar known to the CLDR, whereas the latter contains the calendars we've actually parsed for DateTimeFormat (currently only Gregorian). The new enum generated here can be removed once DateTimeFormat knows about all calendars.
2022-01-04LibJS+LibUnicode: Convert UnicodeLocale to link with weak symbolsTimothy Flynn
2022-01-04LibUnicode: Convert UnicodeDateTimeFormat to link with weak symbolsTimothy Flynn
2021-12-21LibUnicode: Dynamically load the generated UnicodeLocale symbolsTimothy Flynn
2021-11-29LibUnicode: Add special handling of hour cycle (hc) Unicode keywordsTimothy Flynn
For other keywords, allowed values per locale are generated at compile time. But since the CLDR doesn't present hour cycles on a per-locale basis, and hour cycles lookups depend on runtime data, we must handle hour cycle keyword lookups differently than other keywords.
2021-11-29LibJS+LibUnicode: Separate number formatting methods from Locale.hTimothy Flynn
Currently, we generate separate data files for locale and number format related tables/methods, but provide public accessors for all of the data in one Locale.h file. Rather than continuing this trend for date-time, relative time, etc. formatting, it's a bit easier to reason about if the public accessors are also in separate files.
2021-11-16LibUnicode: Parse and generate CLDR unit data for Intl.NumberFormatTimothy Flynn
The units data is in another CLDR package, cldr-units.
2021-11-16LibUnicode: Tweak the definition of the plurality "many"Timothy Flynn
As noted at the top of this method, this is a naive implementation of the Unicode plurality specification. But for now, we should tweak the defintion of "many" to be "more than 2" (which is what I had in mind when I wrote this, but forgot about fractions).
2021-11-16LibJS+LibUnicode: Rename method to select a NumberFormat pluralityTimothy Flynn
Instead of currency pattern lookups within select_currency_unit_pattern, rename the method to select_pattern_with_plurality and accept any list of patterns. This method will be needed for units.
2021-11-14LibUnicode: Generate primary and secondary number grouping sizesTimothy Flynn
Most locales have a single grouping size (the number of integer digits to be written before inserting a grouping separator). However some have a primary and secondary size. We parse the primary size as the size used for the least significant integer digits, and the secondary size for the most significant.
2021-11-13LibJS+LibUnicode: Don't remove {currency} keys in GetNumberFormatPatternTimothy Flynn
In order to implement Intl.NumberFormat.prototype.formatToParts, do not replace {currency} keys in the format pattern before ECMA-402 tells us to. Otherwise, the array return by formatToParts will not contain the expected currency key. Early replacement was done to avoid resolving the currency display more than once, as it involves a couple of round trips to search through LibUnicode data. So this adds a non-standard method to NumberFormat to do this resolution and cache the result. Another side effect of this change is that LibUnicode must replace unit format patterns of the form "{0} {1}" during code generation. These were previously skipped during code generation because LibJS would just replace the keys with the currency display at runtime. But now that the currency display injection is delayed, any {0} or {1} keys in the format pattern will cause PartitionNumberPattern to abort.
2021-11-13LibUnicode: Handle all space code points when creating currency patternsTimothy Flynn
Previously, we were checking if the code point immediately before/after the {currency} key was U+00A0 (non-breaking space). Instead, to handle other spacing code points, we must check if the surrounding code point has the separator general category.
2021-11-13LibUnicode: Remove GeneralCategory::Symbol string lookupTimothy Flynn
When I originally wrote this method, I had it in LibJS, where we can't refer to the GeneralCategory enumeration directly. This is a big TODO, anyone outside of LibUnicode can't assume the generated enumerations exist and must get these values by string lookup. But this function ended up living in LibUnicode, who can reference the enumeration.
2021-11-13LibJS+LibUnicode: Fully implement currency number formattingTimothy Flynn
Currencies are a bit strange; the layout of currency data in the CLDR is not particularly compatible with what ECMA-402 expects. For example, the currency format in the "en" and "ar" locales for the Latin script are: en: "¤#,##0.00" ar: "¤\u00A0#,##0.00" Note how the "ar" locale has a non-breaking space after the currency symbol (¤), but "en" does not. This does not mean that this space will appear in the "ar"-formatted string, nor does it mean that a space won't appear in the "en"-formatted string. This is a runtime decision based on the currency display chosen by the user ("$" vs. "USD" vs. "US dollar") and other rules in the Unicode TR-35 spec. ECMA-402 shies away from the nuances here with "implementation-defined" steps. LibUnicode will store the data parsed from the CLDR however it is presented; making decisions about spacing, etc. will occur at runtime based on user input.
2021-11-13LibJS+LibUnicode: Generate all styles of currency localizationsTimothy Flynn
Currently, LibUnicode is only parsing and generating the "long" style of currency display names. However, the CLDR contains "short" and "narrow" forms as well that need to be handled. Parse these, and update LibJS to actually respect the "style" option provided by the user for displaying currencies with Intl.DisplayNames. Note: There are some discrepencies between the engines on how style is handled. In particular, running: new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd') Gives: SpiderMoney: "USD" V8: "US Dollar" LibJS: "$" And running: new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd') Gives: SpiderMonkey: "$" V8: "US Dollar" LibJS: "$" My best guess is V8 isn't handling style, and just returning the long form (which is what LibJS did before this commit). And SpiderMoney can handle some styles, but if they don't have a value for the requested style, they fall back to the canonicalized code passed into of().
2021-11-12LibUnicode: Move number formatting code generator to UnicodeNumberFormatTimothy Flynn
2021-11-12LibUnicode: Parse and generate standard percentage formatting rulesTimothy Flynn
2021-11-12LibUnicode: Parse and generate compact decimal formatting rulesTimothy Flynn
2021-11-12LibUnicode: Begin parsing and generating locale number systemsTimothy Flynn
The number system data in the CLDR contains information on how to format numbers in a locale-dependent manner. Start parsing this data, beginning with numeric symbol strings. For example the symbol NaN maps to "NaN" in the en-US locale, and "非數值" in the zh-Hant locale.
2021-09-11LibUnicode: Extract canonicalization of Unicode extension valuesTimothy Flynn
LibJS will need to canonicalize Unicode extension values, so extract the lambda that was doing this work to its own function. This also changes the helpers it invokes to take the provided key as a StringView because we don't need (and won't always have) full String objects here.
2021-09-11LibUnicode: Generate numeric keyword values for each localeTimothy Flynn
This is needed for Intl.NumberFormat's usage of the ResolveLocale AO, where the [[RelevantExtensionKeys]] internal slot will be "nu".
2021-09-08LibUnicode+LibJS: Store locale keyword values as a single stringTimothy Flynn
Previously, LibUnicode would store the values of a keyword as a Vector. For example, the locale "en-u-ca-abc-def" would have its keyword "ca" stored as {"abc, "def"}. Then, canonicalization would occur on each of the elements in that Vector. This is incorrect because, for example, the keyword value "true" should only be dropped if that is the entire value. That is, the canonical form of "en-u-kb-true" is "en-u-kb", but "en-u-kb-abc-true" does not change for canonicalization. However, we would canonicalize that locale as "en-u-kb-abc".
2021-09-08LibUnicode: Update comment with link to related upstream issueTimothy Flynn
LibUnicode has to hard-code some aliases because the related data is not available in the JSON export of CLDR. Turns out there is a ticket to add this data in an upcoming CLDR release. Add a link to that ticket for reference.
2021-09-06LibUnicode: Parse and generate the Unicode locale list patterns datasetTimothy Flynn
This data informs consumers how to join lists of values. For example, in en-US, the list ["a", "b", "c"] formatted to a string should become "a, b, and c".
2021-09-06LibUnicode: Add public wrapper for the generated locale_from_stringTimothy Flynn
2021-09-04LibUnicode: Implement the Remove Likely Subtags methodTimothy Flynn
Unlike Add Likely Subtags, this method doesn't require generated data. Instead, it is defined in terms of Add Likely Subtags.
2021-09-04LibUnicode: Generate an implementation of the Add Likely Subtags methodTimothy Flynn
2021-09-04LibUnicode: Define is_unicode_*_subtag helpers inline in their headerTimothy Flynn
The UnicodeLocale generator will need to parse canonicalized locale strings, and will require using these methods. However, the generator cannot depend on LibUnicode because Locale.cpp within LibUnicode already depends on the generated file. Instead, defining the methods that the generator needs inline allows the generator to use them without linking against LibUnicode.
2021-09-02LibUnicode: Add helper methods to LocaleID and LanguageID for LibJSTimothy Flynn
Add a method to remove an extension type from the locale's extension set and methods to convert a locale and language to a string without canonicalization. Each of these will be used by LibJS.
2021-09-02LibUnicode: Add lexer to test if a string matches the "type" productionTimothy Flynn
2021-09-01LibUnicode: Resolve the most likely territory alias when there are manyTimothy Flynn
2021-09-01LibUnicode: Generate Unicode locale likely subtag dataTimothy Flynn
CLDR contains a set of likely subtag data where, given a locale, you can resolve what is the most likely language, script, or territory of that locale. This data is needed for resolving territory aliases. These aliases might contain multiple territories, and we need to resolve which of those territories is most likely correct for a locale. Note that the likely subtag data is quite huge (a few thousand entries). As an optimization encouraged by the spec, we only generate the smallest subset of this data that we actually need (about 150 entries).
2021-09-01LibUnicode: Perform complex Unicode locale alias substitutionTimothy Flynn
2021-09-01LibUnicode: Canonicalize calendar subtagsTimothy Flynn
Calendar subtags are a bit of an odd-man-out in that we must match the variants "ethiopic-amete-alem" in that order, without any other variant in the locale. So a separate method is needed for this, and we now defer sorting the variant list until after other canonicalization is done.
2021-09-01LibUnicode: Canonicalize timezone subtagsTimothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "imperial" to "uksystem"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "primary" and "tertiary" to "levelN"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "names" to "prprname"Timothy Flynn
2021-09-01LibUnicode: Canonicalize the subtag "yes" to "true"Timothy Flynn
2021-09-01LibUnicode: Substitute Unicode locale aliases during canonicalizationTimothy Flynn
Unicode TR35 defines how locale subtag aliases should be emplaced when converting a locale to canonical form. For most subtags, it is a simple substitution. Language subtags depend on context; for example, the language "sh" should become "sr-Latn", but if the original locale has a script subtag already ("sh-Cyrl"), then only the language subtag of the alias should be taken ("sr-Latn"). To facilitate this, we now make two passes when canonicalizing a locale. In the first pass, we convert the LocaleID structure to canonical syntax (where the conversions all happen in-place). In the second pass, we form the canonical string based on the canonical syntax.
2021-09-01LibUnicode: Generate Unicode locale alias dataTimothy Flynn
CLDR contains a set of aliases for languages, territories, etc. that no longer are meant to be used (e.g. due to deprecation). For example, the language "aam" is deprecated and should be canonicalized as "aas".