Age | Commit message (Collapse) | Author |
|
|
|
Equivalent to std::ranges::any_of, which clang-tidy suggests.
|
|
This stops clang-tidy from suggesting that this function can be made
static, although accessing `this->operator==` in the lambda function.
|
|
Also replace String creation from `""` with `String::empty()`
|
|
|
|
|
|
|
|
...by flattening the underlying bytecode chunks first.
Also avoid calling DisjointChunks::size() inside a loop.
This is a very significant improvement in performance, making the
compilation of a large regex with lots of alternatives take only ~100ms
instead of many minutes (I ran out of patience waiting for it) :^)
|
|
That method is O(n), and should generally be avoided.
|
|
|
|
|
|
The instructions can have dependencies (e.g. Repeat), so only unify
equal blocks instead of consecutive instructions.
Fixes #11247.
Also adds the minimal test case(s) from that issue.
|
|
The initial `ForkStay` is only needed if the looping block has a
following block, if there's no following block or the following block
does not attempt to match anything, we should not insert the ForkStay,
otherwise we would be rewriting `a+` as `a*` by allowing the 'end' to be
executed.
Fixes #10952.
|
|
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
|
|
Previously we were always choosing the "nothing special" code path, even
if the dollar symbol was at the end of the pattern (and therefore should
have been considered special).
Fix that by actually checking if the pattern end follows, and emitting
the correct instruction if necessary.
|
|
|
|
|
|
Doing so would cause patterns like `(a|)` to not match the empty string.
|
|
In particular, we implicitly required that the caller initializes the
returned instances themselves (solved by making
UniformBumpAllocator::allocate call the constructor), and BumpAllocator
itself cannot handle classes that are not trivially deconstructible
(solved by deleting the method).
Co-authored-by: Ali Mohammad Pur <ali.mpfard@gmail.com>
|
|
The implementation of transform_bytecode_repetition_min_max expects at
least min=1, so let's also optimise on our way out of a bug where
'x{0,}' would cause a crash.
|
|
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
|
|
Instead of making a new string to compare against, simply grab the first
code unit of the input and compare against that.
|
|
It's very common to encounter single-character strings in JavaScript on
the web. We can make such strings significantly lighter by having a
1-character inline capacity on the Vectors.
|
|
The time we were spending on these signposts was adding up to way too
much, so let's not do it automatically.
|
|
|
|
This avoids doing DisjointChunks traversal for every bytecode access,
significantly reducing startup time for large regular expressions.
|
|
|
|
Flagged by pvs-studio, it looks like these were intended to be passed by
reference originally, but it was missed. This avoids excessive argument
copy when searching / matching in the regex API.
Before:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5998.29 ms (median: 5991, stddev: 102.18)
After:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5623.2 ms (median: 5623, stddev: 86.25)
|
|
Otherwise the fork in patterns like `(1+)\1` would be (incorrectly)
optimized away.
|
|
Fixes #10027.
|
|
Previously we would've copied the bytecode instead of moving the chunks
around, use the fancy new DisjointChunks<T> abstraction to make that
happen automagically.
This decreases vector copies and uses of memmove() by nearly 10x :^)
|
|
|
|
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
|
|
Otherwise the left and right capture instructions wouldn't point to the
same capture group if there was another nested group there.
|
|
|
|
Doing so would increase memory consumption by quite a bit, since many
useless copies of the checkpoints hashmap would be created and later
thrown away.
|
|
This makes (addmittedly weird) patterns like `(a*)*` work correctly
without going into an infinite fork loop.
|
|
|
|
Fixes a bug where backreferences could cause desync between the
code point index and regular index, making comparison off-by-N.
|
|
- Make sure that all the Repeat ops are reset (otherwise the operation
would not be correct when going over the Repeat op a second time)
- Make sure that all matches that are allowed to fail are backed by a
fork, otherwise the last failing fork would not have anywhere to
return to.
Fixes #9707.
|
|
This makes repetitions with large max bounds work correctly.
Also fixes an OOM issue found by OSS-Fuzz:
https://oss-fuzz.com/testcase?key=4725721980338176
|
|
Found by OSS-Fuzz: https://oss-fuzz.com/testcase?key=4869334212673536
|
|
That check was rather pointless as the input is a StringView which knows
its own bounds.
Fixes #9686.
|
|
For example, consider the following pattern:
new RegExp('\ud834\udf06', 'u')
With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.
Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
|
|
The parser now stores this as a FlyString everywhere, so consumers can
also use it as a FlyString.
|
|
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.
Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.
|
|
|
|
This will allow regex::Lexer users to invoke GenericLexer consumption
methods, such as GenericLexer::consume_escaped_codepoint().
This also allows for de-duplicating common methods between the lexers.
|
|
This is primarily to be able to remove the GenericLexer include out of
Format.h as well. A subsequent commit will add AK::Result to
GenericLexer, which will cause naming conflicts with other structures
named Result. This can be avoided (for now) by preventing nearly every
file in the system from implicitly including GenericLexer.
Other changes in this commit are to add the GenericLexer include to
files where it is missing.
|
|
This was missed in commit 27d555bab0d84913599cea3c4a6b0a0ed2a15b66.
|