summaryrefslogtreecommitdiff
path: root/Userland/Libraries/LibRegex/RegexByteCode.cpp
AgeCommit message (Collapse)Author
2022-02-05LibRegex: Support non-ASCII whitespace characters when matching \s or \STimothy Flynn
ECMA-262 defines \s as: Return the CharSet containing all characters corresponding to a code point on the right-hand side of the WhiteSpace or LineTerminator productions. The LineTerminator production is simply: U+000A, U+000D, U+2028, or U+2029. Unfortunately there isn't a Unicode property that covers just those code points. The WhiteSpace production is: U+0009, U+000B, U+000C, U+FEFF, or any code point with the Space_Separator general category. If the Unicode generators are disabled, this will fall back to ASCII space code points.
2022-01-26LibRegex: Implement ECMA262 multiline matching without splitting linesAli Mohammad Pur
As ECMA262 regex allows `[^]` and literal newlines to match newlines in the input string, we shouldn't split the input string into lines, rather simply make boundaries and catchall patterns capable of checking for these conditions specifically.
2022-01-22LibRegex: Add some more information to Compare::Reference debug outputAli Mohammad Pur
2022-01-22LibRegex: Allow ClearCaptureGroup to create new groupsAli Mohammad Pur
Instead of leaking all capture groups and selectively clearing some, simply avoid leaking things and only "define" the ones that need to exist. This *actually* implements the capture groups ECMA262 quirk. Also adds the test removed in the previous commit (to avoid messing up test runs across bisects).
2021-12-25LibRegex: Make FailForks fail all forks up to the last save pointAli Mohammad Pur
This makes negative lookarounds with more than one fork behave correctly. Fixes #11350.
2021-12-21LibRegex: Remove some meaningless/useless const-qualifiersHendiadyoin1
Also replace String creation from `""` with `String::empty()`
2021-12-21LibRegex: Collapse some `if(...) return true; else return false;` blocksHendiadyoin1
2021-12-21LibRegex: Add some implied auto qualifiersHendiadyoin1
2021-12-21LibRegex: Avoid calling DisjointChunks::size() in get_opcode()Ali Mohammad Pur
That method is O(n), and should generally be avoided.
2021-11-11Everywhere: Pass AK::StringView by valueAndreas Kling
2021-10-03LibRegex: Use a match table for character classesAli Mohammad Pur
Generate a sorted, compressed series of ranges in a match table for character classes, and use a binary search to find the matches. This is about a 3-4x speedup for character class match performance. :^)
2021-10-03LibRegex: Avoid creating a new temporary RegexStringView in Char compareAli Mohammad Pur
Instead of making a new string to compare against, simply grab the first code unit of the input and compare against that.
2021-10-02LibJS+AK: Use Vector<u16, 1> for UTF-16 string storageAndreas Kling
It's very common to encounter single-character strings in JavaScript on the web. We can make such strings significantly lighter by having a 1-character inline capacity on the Vectors.
2021-09-13LibRegex: Add a basic optimization passAli Mohammad Pur
This currently tries to convert forking loops to atomic groups, and unify the left side of alternations.
2021-09-06LibRegex: Avoid keeping track of checkpoints across forksAli Mohammad Pur
Doing so would increase memory consumption by quite a bit, since many useless copies of the checkpoints hashmap would be created and later thrown away.
2021-09-06LibRegex: Make infinite repetitions short-circuit on empty matchesAli Mohammad Pur
This makes (addmittedly weird) patterns like `(a*)*` work correctly without going into an infinite fork loop.
2021-09-01LibRegex: Correctly advance string positions in Compare::compare_stringAli Mohammad Pur
Fixes a bug where backreferences could cause desync between the code point index and regular index, making comparison off-by-N.
2021-09-01LibRegex: Correctly handle failing in the middle of explicit repeatsAli Mohammad Pur
- Make sure that all the Repeat ops are reset (otherwise the operation would not be correct when going over the Repeat op a second time) - Make sure that all matches that are allowed to fail are backed by a fork, otherwise the last failing fork would not have anywhere to return to. Fixes #9707.
2021-08-18LibRegex: Ensure the GoBack operation decrements the code unit indexTimothy Flynn
This was missed in commit 27d555bab0d84913599cea3c4a6b0a0ed2a15b66.
2021-08-15LibRegex: Implement and use a REPEAT operation for bytecode repetitionTimothy Flynn
Currently, when we need to repeat an instruction N times, we simply add that instruction N times in a for-loop. This doesn't scale well with extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1. Instead, add a new REPEAT bytecode operation to defer this loop from the parser to the runtime executor. This allows the parser to complete sans any loops (for this instruction), and allows the executor to bail early if the repeated bytecode fails. Note: The templated ByteCode methods are to allow the Posix parsers to continue using u32 because they are limited to N = 2^20.
2021-08-15LibRegex: Remove (mostly) unused regex::MatchOutputTimothy Flynn
This struct holds a counter for the number of executed operations, and vectors for matches, captures groups, and named capture groups. Each of the vectors is unused. Remove the struct and just keep a separate counter for the executed operations.
2021-08-15LibRegex+LibJS: Combine named and unnamed capture groups in MatchStateTimothy Flynn
Combining these into one list helps reduce the size of MatchState, and as a result, reduces the amount of memory consumed during execution of very large regex matches. Doing this also allows us to remove a few regex byte code instructions: ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference. Named groups now behave the same as unnamed groups for these operations. Note that SaveRightNamedCaptureGroup still exists to cache the matched group name. This also removes the recursion level from the MatchState, as it can exist as a local variable in Matcher::execute instead.
2021-08-04LibRegex: Support property escapes of Unicode script extensionsTimothy Flynn
2021-08-04LibRegex: Support property escapes of the Unicode script propertyTimothy Flynn
Note that unlike binary properties and general categories, scripts must be specified in the non-binary (Script=Value) form.
2021-08-04LibRegex: Track string position in both code units and code pointsTimothy Flynn
In non-Unicode mode, the existing MatchState::string_position is tracked in code units; in Unicode mode, it is tracked in code points. In order for some RegexStringView operations to be performant, it is useful for the MatchState to have a field to always track the position in code units. This will allow RegexStringView methods (e.g. operator[]) to perform lookups based on code unit offsets, rather than needing to iterate over the entire string to find a code point offset.
2021-08-02LibRegex: Support property escapes of Unicode General CategoriesTimothy Flynn
This changes LibRegex to parse the property escape as a Variant of Unicode Property & General Category values. A byte code instruction is added to perform matching based on General Category values.
2021-07-30LibRegex+LibUnicode: Begin implementing Unicode property escapesTimothy Flynn
This supports some binary property matching. It does not support any properties not yet parsed by LibUnicode, nor does it support value matching (such as Script_Extensions=Latin).
2021-07-24LibRegex: Preserve the type of the match when clearing capture groupsAli Mohammad Pur
Even though the contents are supposed to be reset, the type should stay unchanged, as that's an assumption the engine is making.
2021-07-23LibRegex: Support UTF-16 RegexStringView and improve Unicode matchingTimothy Flynn
When the Unicode option is not set, regular expressions should match based on code units; when it is set, they should match based on code points. To do so, the regex parser must combine surrogate pairs when the Unicode option is set. Further, RegexStringView needs to know if the flag is set in order to return code point vs. code unit based string lengths and substrings.
2021-07-23LibRegex: Switch to east-const styleAli Mohammad Pur
2021-07-23LibRegex: Clear previous capture group contents in ECMA262 modeAli Mohammad Pur
ECMA262 requires that the capture groups only contain the values from the last iteration, e.g. `((c)(a)?(b))` should _not_ contain 'a' in the second capture group when matching "cabcb".
2021-07-18LibRegex+Everywhere: Make LibRegex more unicode-awareAli Mohammad Pur
This commit makes LibRegex (mostly) capable of operating on any of the three main string views: - StringView for raw strings - Utf8View for utf-8 encoded strings - Utf32View for raw unicode strings As a result, regexps with unicode strings should be able to properly handle utf-8 and not stop in the middle of a code point. A future commit will update LibJS to use the correct type of string depending on the flags.
2021-07-18LibRegex: Partially implement string compare for Utf32ViewAli Mohammad Pur
2021-06-16LibRegex: Display correct position for Compare in REGEX_DEBUGsin-ack
When REGEX_DEBUG is enabled, LibRegex dumps a table of information regarding the state of the regex bytecode execution. The Compare opcode manipulates state.string_position directly, so the string_position value cannot be used to display where the comparison started; therefore, this patch introduces a new variable to keep track of where we were before the comparison happened.
2021-06-16LibRegex: Fix incorrect case-sensitive comparisonssin-ack
A tiny typo was introduced in bc8d16ad which caused all case insensitive comparisons to fail.
2021-06-14LibRegex: Avoid initialization checks in get_opcode_by_id()Gunnar Beutner
2021-06-14LibRegex: Avoid making unnecessary string copiesGunnar Beutner
2021-06-14LibRegex: Make get_opcode() return a referenceGunnar Beutner
Previously this would return a pointer which could be null if the requested opcode was invalid. This should never be the case though so let's VERIFY() that instead.
2021-06-14LibRegex: Remove return value for settersGunnar Beutner
2021-06-14LibRegex: Use a plain array to store opcodesGunnar Beutner
Using a hash map is unnecessary because the number of opcodes and their IDs never change.
2021-06-03Everywhere: Replace ctype.h to avoid narrowing conversionsMax Wipfli
This replaces ctype.h with CharacterType.h everywhere I could find issues with narrowing conversions. While using it will probably make sense almost everywhere in the future, the most critical places should have been addressed.
2021-05-31LibRegex: Replace fprintf()/printf() with warnln()/outln()/dbgln()Linus Groh
2021-04-25Everywhere: Remove empty line after function body opening curly braceLinus Groh
2021-04-22Everything: Move to SPDX license identifiers in all files.Brian Gianforcaro
SPDX License Identifiers are a more compact / standardized way of representing file license information. See: https://spdx.dev/resources/use/#identifiers This was done with the `ambr` search and replace tool. ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
2021-04-21LibRegex: Convert String::format() => String::formatted()Andreas Kling
2021-04-01LibRegex: Allow references to capture groups that aren't parsed yetAnotherTest
This only applies to the ECMA262 parser. This behaviour is an ECMA262-specific quirk, such references always generate zero-length matches (even on subsequent passes). Also adds a test in LibJS's test suite. Fixes #6039.
2021-02-27LibRegex: Implement section B.1.4. of the ECMA262 specAnotherTest
This allows the parser to deal with crazy patterns like the one in #5517.
2021-02-23Everywhere: Rename ASSERT => VERIFYAndreas Kling
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED) Since all of these checks are done in release builds as well, let's rename them to VERIFY to prevent confusion, as everyone is used to assertions being compiled out in release. We can introduce a new ASSERT macro that is specifically for debug checks, but I'm doing this wholesale conversion first since we've accumulated thousands of these already, and it's not immediately obvious which ones are suitable for ASSERT.
2021-01-25Everywhere: Debug macros instead of constexpr.asynts
This was done with the following script: find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/dbgln<debug_([a-z_]+)>/dbgln<\U\1_DEBUG>/' {} \; find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/if constexpr \(debug_([a-z0-9_]+)/if constexpr \(\U\1_DEBUG/' {} \;
2021-01-22Everywhere: Replace a bundle of dbg with dbgln.asynts
These changes are arbitrarily divided into multiple commits to make it easier to find potentially introduced bugs with git bisect.