serenity - The Serenity Operating System 🐞

Age	Commit message (Collapse)	Author
2022-02-05	LibRegex: Support non-ASCII whitespace characters when matching \s or \S	Timothy Flynn
	ECMA-262 defines \s as: Return the CharSet containing all characters corresponding to a code point on the right-hand side of the WhiteSpace or LineTerminator productions. The LineTerminator production is simply: U+000A, U+000D, U+2028, or U+2029. Unfortunately there isn't a Unicode property that covers just those code points. The WhiteSpace production is: U+0009, U+000B, U+000C, U+FEFF, or any code point with the Space_Separator general category. If the Unicode generators are disabled, this will fall back to ASCII space code points.
2022-01-26	LibRegex: Implement ECMA262 multiline matching without splitting lines	Ali Mohammad Pur
	As ECMA262 regex allows `[^]` and literal newlines to match newlines in the input string, we shouldn't split the input string into lines, rather simply make boundaries and catchall patterns capable of checking for these conditions specifically.
2022-01-22	LibRegex: Add some more information to Compare::Reference debug output	Ali Mohammad Pur

2022-01-22	LibRegex: Allow ClearCaptureGroup to create new groups	Ali Mohammad Pur
	Instead of leaking all capture groups and selectively clearing some, simply avoid leaking things and only "define" the ones that need to exist. This actually implements the capture groups ECMA262 quirk. Also adds the test removed in the previous commit (to avoid messing up test runs across bisects).
2021-12-25	LibRegex: Make FailForks fail all forks up to the last save point	Ali Mohammad Pur
	This makes negative lookarounds with more than one fork behave correctly. Fixes #11350.
2021-12-21	LibRegex: Remove some meaningless/useless const-qualifiers	Hendiadyoin1
	Also replace String creation from `""` with `String::empty()`
2021-12-21	LibRegex: Collapse some `if(...) return true; else return false;` blocks	Hendiadyoin1

2021-12-21	LibRegex: Add some implied auto qualifiers	Hendiadyoin1

2021-12-21	LibRegex: Avoid calling DisjointChunks::size() in get_opcode()	Ali Mohammad Pur
	That method is O(n), and should generally be avoided.
2021-11-11	Everywhere: Pass AK::StringView by value	Andreas Kling

2021-10-03	LibRegex: Use a match table for character classes	Ali Mohammad Pur
	Generate a sorted, compressed series of ranges in a match table for character classes, and use a binary search to find the matches. This is about a 3-4x speedup for character class match performance. :^)
2021-10-03	LibRegex: Avoid creating a new temporary RegexStringView in Char compare	Ali Mohammad Pur
	Instead of making a new string to compare against, simply grab the first code unit of the input and compare against that.
2021-10-02	LibJS+AK: Use Vector<u16, 1> for UTF-16 string storage	Andreas Kling
	It's very common to encounter single-character strings in JavaScript on the web. We can make such strings significantly lighter by having a 1-character inline capacity on the Vectors.
2021-09-13	LibRegex: Add a basic optimization pass	Ali Mohammad Pur
	This currently tries to convert forking loops to atomic groups, and unify the left side of alternations.
2021-09-06	LibRegex: Avoid keeping track of checkpoints across forks	Ali Mohammad Pur
	Doing so would increase memory consumption by quite a bit, since many useless copies of the checkpoints hashmap would be created and later thrown away.
2021-09-06	LibRegex: Make infinite repetitions short-circuit on empty matches	Ali Mohammad Pur
	This makes (addmittedly weird) patterns like `(a)` work correctly without going into an infinite fork loop.
2021-09-01	LibRegex: Correctly advance string positions in Compare::compare_string	Ali Mohammad Pur
	Fixes a bug where backreferences could cause desync between the code point index and regular index, making comparison off-by-N.
2021-09-01	LibRegex: Correctly handle failing in the middle of explicit repeats	Ali Mohammad Pur
	- Make sure that all the Repeat ops are reset (otherwise the operation would not be correct when going over the Repeat op a second time) - Make sure that all matches that are allowed to fail are backed by a fork, otherwise the last failing fork would not have anywhere to return to. Fixes #9707.
2021-08-18	LibRegex: Ensure the GoBack operation decrements the code unit index	Timothy Flynn
	This was missed in commit 27d555bab0d84913599cea3c4a6b0a0ed2a15b66.
2021-08-15	LibRegex: Implement and use a REPEAT operation for bytecode repetition	Timothy Flynn
	Currently, when we need to repeat an instruction N times, we simply add that instruction N times in a for-loop. This doesn't scale well with extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1. Instead, add a new REPEAT bytecode operation to defer this loop from the parser to the runtime executor. This allows the parser to complete sans any loops (for this instruction), and allows the executor to bail early if the repeated bytecode fails. Note: The templated ByteCode methods are to allow the Posix parsers to continue using u32 because they are limited to N = 2^20.
2021-08-15	LibRegex: Remove (mostly) unused regex::MatchOutput	Timothy Flynn
	This struct holds a counter for the number of executed operations, and vectors for matches, captures groups, and named capture groups. Each of the vectors is unused. Remove the struct and just keep a separate counter for the executed operations.
2021-08-15	LibRegex+LibJS: Combine named and unnamed capture groups in MatchState	Timothy Flynn
	Combining these into one list helps reduce the size of MatchState, and as a result, reduces the amount of memory consumed during execution of very large regex matches. Doing this also allows us to remove a few regex byte code instructions: ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference. Named groups now behave the same as unnamed groups for these operations. Note that SaveRightNamedCaptureGroup still exists to cache the matched group name. This also removes the recursion level from the MatchState, as it can exist as a local variable in Matcher::execute instead.
2021-08-04	LibRegex: Support property escapes of Unicode script extensions	Timothy Flynn

2021-08-04	LibRegex: Support property escapes of the Unicode script property	Timothy Flynn
	Note that unlike binary properties and general categories, scripts must be specified in the non-binary (Script=Value) form.
2021-08-04	LibRegex: Track string position in both code units and code points	Timothy Flynn
	In non-Unicode mode, the existing MatchState::string_position is tracked in code units; in Unicode mode, it is tracked in code points. In order for some RegexStringView operations to be performant, it is useful for the MatchState to have a field to always track the position in code units. This will allow RegexStringView methods (e.g. operator[]) to perform lookups based on code unit offsets, rather than needing to iterate over the entire string to find a code point offset.
2021-08-02	LibRegex: Support property escapes of Unicode General Categories	Timothy Flynn
	This changes LibRegex to parse the property escape as a Variant of Unicode Property & General Category values. A byte code instruction is added to perform matching based on General Category values.
2021-07-30	LibRegex+LibUnicode: Begin implementing Unicode property escapes	Timothy Flynn
	This supports some binary property matching. It does not support any properties not yet parsed by LibUnicode, nor does it support value matching (such as Script_Extensions=Latin).
2021-07-24	LibRegex: Preserve the type of the match when clearing capture groups	Ali Mohammad Pur
	Even though the contents are supposed to be reset, the type should stay unchanged, as that's an assumption the engine is making.
2021-07-23	LibRegex: Support UTF-16 RegexStringView and improve Unicode matching	Timothy Flynn
	When the Unicode option is not set, regular expressions should match based on code units; when it is set, they should match based on code points. To do so, the regex parser must combine surrogate pairs when the Unicode option is set. Further, RegexStringView needs to know if the flag is set in order to return code point vs. code unit based string lengths and substrings.
2021-07-23	LibRegex: Switch to east-const style	Ali Mohammad Pur

2021-07-23	LibRegex: Clear previous capture group contents in ECMA262 mode	Ali Mohammad Pur
	ECMA262 requires that the capture groups only contain the values from the last iteration, e.g. `((c)(a)?(b))` should _not_ contain 'a' in the second capture group when matching "cabcb".
2021-07-18	LibRegex+Everywhere: Make LibRegex more unicode-aware	Ali Mohammad Pur
	This commit makes LibRegex (mostly) capable of operating on any of the three main string views: - StringView for raw strings - Utf8View for utf-8 encoded strings - Utf32View for raw unicode strings As a result, regexps with unicode strings should be able to properly handle utf-8 and not stop in the middle of a code point. A future commit will update LibJS to use the correct type of string depending on the flags.
2021-07-18	LibRegex: Partially implement string compare for Utf32View	Ali Mohammad Pur

2021-06-16	LibRegex: Display correct position for Compare in REGEX_DEBUG	sin-ack
	When REGEX_DEBUG is enabled, LibRegex dumps a table of information regarding the state of the regex bytecode execution. The Compare opcode manipulates state.string_position directly, so the string_position value cannot be used to display where the comparison started; therefore, this patch introduces a new variable to keep track of where we were before the comparison happened.
2021-06-16	LibRegex: Fix incorrect case-sensitive comparisons	sin-ack
	A tiny typo was introduced in bc8d16ad which caused all case insensitive comparisons to fail.
2021-06-14	LibRegex: Avoid initialization checks in get_opcode_by_id()	Gunnar Beutner

2021-06-14	LibRegex: Avoid making unnecessary string copies	Gunnar Beutner

2021-06-14	LibRegex: Make get_opcode() return a reference	Gunnar Beutner
	Previously this would return a pointer which could be null if the requested opcode was invalid. This should never be the case though so let's VERIFY() that instead.
2021-06-14	LibRegex: Remove return value for setters	Gunnar Beutner

2021-06-14	LibRegex: Use a plain array to store opcodes	Gunnar Beutner
	Using a hash map is unnecessary because the number of opcodes and their IDs never change.
2021-06-03	Everywhere: Replace ctype.h to avoid narrowing conversions	Max Wipfli
	This replaces ctype.h with CharacterType.h everywhere I could find issues with narrowing conversions. While using it will probably make sense almost everywhere in the future, the most critical places should have been addressed.
2021-05-31	LibRegex: Replace fprintf()/printf() with warnln()/outln()/dbgln()	Linus Groh

2021-04-25	Everywhere: Remove empty line after function body opening curly brace	Linus Groh

2021-04-22	Everything: Move to SPDX license identifiers in all files.	Brian Gianforcaro
	SPDX License Identifiers are a more compact / standardized way of representing file license information. See: https://spdx.dev/resources/use/#identifiers This was done with the `ambr` search and replace tool. ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-21	LibRegex: Convert String::format() => String::formatted()	Andreas Kling

2021-04-01	LibRegex: Allow references to capture groups that aren't parsed yet	AnotherTest
	This only applies to the ECMA262 parser. This behaviour is an ECMA262-specific quirk, such references always generate zero-length matches (even on subsequent passes). Also adds a test in LibJS's test suite. Fixes #6039.
2021-02-27	LibRegex: Implement section B.1.4. of the ECMA262 spec	AnotherTest
	This allows the parser to deal with crazy patterns like the one in #5517.
2021-02-23	Everywhere: Rename ASSERT => VERIFY	Andreas Kling
	(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED) Since all of these checks are done in release builds as well, let's rename them to VERIFY to prevent confusion, as everyone is used to assertions being compiled out in release. We can introduce a new ASSERT macro that is specifically for debug checks, but I'm doing this wholesale conversion first since we've accumulated thousands of these already, and it's not immediately obvious which ones are suitable for ASSERT.
2021-01-25	Everywhere: Debug macros instead of constexpr.	asynts
	This was done with the following script: find . \( -name '.cpp' -o -name '.h' -o -name '.in' \) -not -path './Toolchain/' -not -path './Build/' -exec sed -i -E 's/dbgln<debug_([a-z_]+)>/dbgln<\U\1_DEBUG>/' {} \; find . \( -name '.cpp' -o -name '.h' -o -name '.in' \) -not -path './Toolchain/' -not -path './Build/' -exec sed -i -E 's/if constexpr \(debug_([a-z0-9_]+)/if constexpr \(\U\1_DEBUG/' {} \;
2021-01-22	Everywhere: Replace a bundle of dbg with dbgln.	asynts
	These changes are arbitrarily divided into multiple commits to make it easier to find potentially introduced bugs with git bisect.