summaryrefslogtreecommitdiff
path: root/Meta/Lagom/Tools/CodeGenerators/LibUnicode
AgeCommit message (Collapse)Author
2021-11-11Everywhere: Pass AK::StringView by valueAndreas Kling
2021-11-09LibUnicode: Parse the CLDR's defaultContent.json locale listTimothy Flynn
This file contains the list of locales which default to their parent locale's values. In the core CLDR dataset, these locales have their own files, but they are empty (except for identity data). For example: https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml In the JSON export, these files are excluded, so we currently are not recognizing these locales just by iterating the locale files. This is a prerequisite for upgrading to CLDR version 40. One of these default-content locales is the popular "en-US" locale, which defaults to "en" values. We were previously inferring the existence of this locale from the "en-US-POSIX" locale (many implementations, including ours, strip variants such as POSIX). However, v40 removes the "en-US-POSIX" locale entirely, meaning that without this change, we wouldn't know that "en-US" exists (we would default to "en"). For more detail on this and other v40 changes, see: https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
2021-10-15LibUnicode: Use u16 for unique string indices instead of size_tTimothy Flynn
Typically size_t is used for indices, but we can take advantage of the knowledge that there is approximately only 46K unique strings in the generated UnicodeLocale.cpp file. Therefore, we can get away with using u16 to store indices. There is a VERIFY that will fail if we ever exceed the limits of u16. On x86_64 builds, this reduces libunicode.so from 9.2 MiB to 7.3 MiB. On i686 builds, this reduces libunicode.so from 3.9 MiB to 3.3 MiB. These savings are entirely in the .rodata section of the shared library.
2021-10-13LibUnicode: Generate enum/alias from-string methods without a HashMapTimothy Flynn
The *_from_string() and resolve_*_alias() generated methods are the last remaining users of HashMap in the LibUnicode generated files (read: the last methods not using compile-time structures). This converts these methods to use an array containing pairs of hash values to the desired lookup value. Because this code generation is the same between GenerateUnicodeData.cpp and GenerateUnicodeLocale.cpp, this adds a GeneratorUtil.h header to the LibUnicode generators to contain the method that generates the methods.
2021-10-10LibUnicode: Generate and use unique locale-related alias stringsTimothy Flynn
Almost all of these are already in the unique string list.
2021-10-10LibUnicode: Generate and use unique subtag and complex alias stringsTimothy Flynn
2021-10-10LibUnicode: Generate and use unique list-format stringsTimothy Flynn
The list-format strings used for Intl.ListFormat are small, but quite heavily duplicated. For example, the string "{0}, {1}" appears 6,519 times. Generate unique strings for this data to avoid duplication.
2021-10-10LibUnicode: Generate and use a set of unique locale-related stringsTimothy Flynn
In the generated UnicodeLocale.cpp file, there are 296,408 strings for localizations of languages, territories, scripts, currencies & keywords. Of these, only 43,848 (14.8%) are actually unique, so there are quite a large number of duplicated strings. This generates a single compile-time array to store these strings. The arrays for the localizations now store an index into this single array rather than duplicating any strings.
2021-10-10LibUnicode: Skip unknown languages and territoriesTimothy Flynn
Some CLDR languages.json / territories.json files contain localizations for some lanuages/territories that are otherwise not present in the CLDR database. We already don't generate anything in UnicodeLocale.cpp for these anomalies, but this will stop us from even storing that data in the generator's memory. This doesn't affect the output of the generator, but will have an effect after an upcoming commit to unique-ify all of the strings in the CLDR.
2021-10-10LibUnicode: Stop generating large UnicodeData hash mapTimothy Flynn
The data in this hash map is now available by way of much smaller arrays and is now unused.
2021-10-10LibUnicode: Generate standalone compile-time array for combining classTimothy Flynn
2021-10-10LibUnicode: Generate standalone compile-time array for special casingTimothy Flynn
There are only 112 code points with special casing rules, so this array is quite small (compared to the size 34,626 UnicodeData hash map that is also storing this data). Removing all casing rules from UnicodeData will happen in a subsequent commit.
2021-10-10LibUnicode: Generate standalone compile-time arrays for simple casingTimothy Flynn
Currently, all casing information (simple and special) are stored in a compile-time array of size 34,626, then statically copied to a hash map at runtime. In an effort to reduce the resulting memory usage, store the simple casing rules in standalone compile-time arrays. The uppercase map is size 1,450 and the lowercase map is size 1,433. Any code point not in a map will implicitly have an identity mapping.
2021-10-01Meta: Fix typosNico Weber
2021-09-30LibUnicode: Do not compare generated file contents before writingTimothy Flynn
This is now covered by unicode_data.cmake after the superbuild changes.
2021-09-15Meta: Define and use lagom_tool() CMake helper function for all ToolsAndrew Kaster
We'll use this to prevent repeating common tool dependencies. They all depend on LibCore and AK only. We also want to encapsulate common install rules for them.
2021-09-11AK: Replace the mutable String::replace API with an immutable versionIdan Horowitz
This removes the awkward String::replace API which was the only String API which mutated the String and replaces it with a new immutable version that returns a new String with the replacements applied. This also fixes a couple of UAFs that were caused by the use of this API. As an optimization an equivalent StringView::replace API was also added to remove an unnecessary String allocations in the format of: `String { view }.replace(...);`
2021-09-11LibUnicode: Generate numeric keyword values for each localeTimothy Flynn
This is needed for Intl.NumberFormat's usage of the ResolveLocale AO, where the [[RelevantExtensionKeys]] internal slot will be "nu".
2021-09-08LibUnicode: Fix typo in listPatterns.json parsing methodTimothy Flynn
2021-09-06LibUnicode: Remove Unicode locale variants from CLDR path namesTimothy Flynn
There's only a couple of cases like this, but there are some locale paths in the CLDR that contain variants. For example, there isn't a en-US path, but there is a en-US-POSIX path. This interferes with the operation to search for locales by name. The algorithm is such that searching for en-US will not result in en-US-POSIX being found. To resolve this, we should remove variants from the locale name.
2021-09-06LibUnicode: Parse and generate the Unicode locale list patterns datasetTimothy Flynn
This data informs consumers how to join lists of values. For example, in en-US, the list ["a", "b", "c"] formatted to a string should become "a, b, and c".
2021-09-06LibUnicode: Extract cldr-misc dataset from CLDR databaseTimothy Flynn
2021-09-06LibUnicode: Sort special casing array by locale specificityTimothy Flynn
This is to simply the Default Case Conversion implementation. Otherwise, the implementation would need to determine which special casing rule to apply, instead of just picking the first match.
2021-09-06LibUnicode: Generate canonical combining class in Unicode dataTimothy Flynn
Will be used by special casing rules.
2021-09-04LibUnicode: Generate an implementation of the Add Likely Subtags methodTimothy Flynn
2021-09-04LibUnicode: Generate the entire locale likely-subtags datasetTimothy Flynn
The amount of aliases in the likely-subtags dataset is quite large, so this also needed to change the way the data is generated. Otherwise, the compiler would complain about the size of the generated code. Previously, a static method was generated that would effectively parse the dataset into a HashMap of Unicode::LanguageID at runtime. We now perform that parsing at generation-time, and instead generate an Array of a structure similar to Unicode::LanguageID (we cannot use the same structure because it contains String and Optional, which cannot be used at compile-time).
2021-09-01LibUnicode: Generate Unicode locale likely subtag dataTimothy Flynn
CLDR contains a set of likely subtag data where, given a locale, you can resolve what is the most likely language, script, or territory of that locale. This data is needed for resolving territory aliases. These aliases might contain multiple territories, and we need to resolve which of those territories is most likely correct for a locale. Note that the likely subtag data is quite huge (a few thousand entries). As an optimization encouraged by the spec, we only generate the smallest subset of this data that we actually need (about 150 entries).
2021-09-01LibUnicode: Generate complex Unicode locale alias matchingTimothy Flynn
Most alias substitutions are "simple", meaning that alias matching is done by examining a single locale subtag. However, there are a handful of "complex" aliases where matching is done by examining multiple subtags. For example, the variant subtag "lojban" causes the locale "art-lojban" to be canonicalized to "jbo", but only when the language subtag is "art" (i.e. this should not occur for the locale "en-lojban"). This generates a method to perform complex alias matching.
2021-09-01LibUnicode: Generate Unicode locale alias dataTimothy Flynn
CLDR contains a set of aliases for languages, territories, etc. that no longer are meant to be used (e.g. due to deprecation). For example, the language "aam" is deprecated and should be canonicalized as "aas".
2021-09-01LibUnicode: Extract cldr-core dataset from CLDR databaseTimothy Flynn
2021-08-28Everywhere: Move all host tools into the Lagom/Tools subdirectoryAndrew Kaster
This allows us to remove all the add_subdirectory calls from the top level CMakeLists.txt that referred to targets linking LagomCore. Segregating the host tools and Serenity targets helps us get to a place where the main Serenity build can simply use a CMake toolchain file rather than swapping all the compiler/sysroot variables after building host libraries and tools.