Age | Commit message (Collapse) | Author |
|
In particular, we implicitly required that the caller initializes the
returned instances themselves (solved by making
UniformBumpAllocator::allocate call the constructor), and BumpAllocator
itself cannot handle classes that are not trivially deconstructible
(solved by deleting the method).
Co-authored-by: Ali Mohammad Pur <ali.mpfard@gmail.com>
|
|
The implementation of transform_bytecode_repetition_min_max expects at
least min=1, so let's also optimise on our way out of a bug where
'x{0,}' would cause a crash.
|
|
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
|
|
Instead of making a new string to compare against, simply grab the first
code unit of the input and compare against that.
|
|
It's very common to encounter single-character strings in JavaScript on
the web. We can make such strings significantly lighter by having a
1-character inline capacity on the Vectors.
|
|
The time we were spending on these signposts was adding up to way too
much, so let's not do it automatically.
|
|
|
|
This avoids doing DisjointChunks traversal for every bytecode access,
significantly reducing startup time for large regular expressions.
|
|
|
|
Flagged by pvs-studio, it looks like these were intended to be passed by
reference originally, but it was missed. This avoids excessive argument
copy when searching / matching in the regex API.
Before:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5998.29 ms (median: 5991, stddev: 102.18)
After:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5623.2 ms (median: 5623, stddev: 86.25)
|
|
Otherwise the fork in patterns like `(1+)\1` would be (incorrectly)
optimized away.
|
|
Fixes #10027.
|
|
Previously we would've copied the bytecode instead of moving the chunks
around, use the fancy new DisjointChunks<T> abstraction to make that
happen automagically.
This decreases vector copies and uses of memmove() by nearly 10x :^)
|
|
|
|
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
|
|
Otherwise the left and right capture instructions wouldn't point to the
same capture group if there was another nested group there.
|
|
|
|
Doing so would increase memory consumption by quite a bit, since many
useless copies of the checkpoints hashmap would be created and later
thrown away.
|
|
This makes (addmittedly weird) patterns like `(a*)*` work correctly
without going into an infinite fork loop.
|
|
|
|
Fixes a bug where backreferences could cause desync between the
code point index and regular index, making comparison off-by-N.
|
|
- Make sure that all the Repeat ops are reset (otherwise the operation
would not be correct when going over the Repeat op a second time)
- Make sure that all matches that are allowed to fail are backed by a
fork, otherwise the last failing fork would not have anywhere to
return to.
Fixes #9707.
|
|
This makes repetitions with large max bounds work correctly.
Also fixes an OOM issue found by OSS-Fuzz:
https://oss-fuzz.com/testcase?key=4725721980338176
|
|
Found by OSS-Fuzz: https://oss-fuzz.com/testcase?key=4869334212673536
|
|
That check was rather pointless as the input is a StringView which knows
its own bounds.
Fixes #9686.
|
|
For example, consider the following pattern:
new RegExp('\ud834\udf06', 'u')
With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.
Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
|
|
The parser now stores this as a FlyString everywhere, so consumers can
also use it as a FlyString.
|
|
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.
Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.
|
|
|
|
This will allow regex::Lexer users to invoke GenericLexer consumption
methods, such as GenericLexer::consume_escaped_codepoint().
This also allows for de-duplicating common methods between the lexers.
|
|
This is primarily to be able to remove the GenericLexer include out of
Format.h as well. A subsequent commit will add AK::Result to
GenericLexer, which will cause naming conflicts with other structures
named Result. This can be avoided (for now) by preventing nearly every
file in the system from implicitly including GenericLexer.
Other changes in this commit are to add the GenericLexer include to
files where it is missing.
|
|
This was missed in commit 27d555bab0d84913599cea3c4a6b0a0ed2a15b66.
|
|
|
|
Currently, when we need to repeat an instruction N times, we simply add
that instruction N times in a for-loop. This doesn't scale well with
extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1.
Instead, add a new REPEAT bytecode operation to defer this loop from the
parser to the runtime executor. This allows the parser to complete sans
any loops (for this instruction), and allows the executor to bail early
if the repeated bytecode fails.
Note: The templated ByteCode methods are to allow the Posix parsers to
continue using u32 because they are limited to N = 2^20.
|
|
This struct holds a counter for the number of executed operations, and
vectors for matches, captures groups, and named capture groups. Each of
the vectors is unused. Remove the struct and just keep a separate
counter for the executed operations.
|
|
Combining these into one list helps reduce the size of MatchState, and
as a result, reduces the amount of memory consumed during execution of
very large regex matches.
Doing this also allows us to remove a few regex byte code instructions:
ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference.
Named groups now behave the same as unnamed groups for these operations.
Note that SaveRightNamedCaptureGroup still exists to cache the matched
group name.
This also removes the recursion level from the MatchState, as it can
exist as a local variable in Matcher::execute instead.
|
|
Before the BumpAllocator OOB access issue was understood and fixed, the
chunk size was increased to 8MiB as a workaround in commit:
27d555bab0d84913599cea3c4a6b0a0ed2a15b66.
The issue is now resolved by: 0f1425c895ace40fbb10d68a55eeb3a6354479d3.
We can reduce the chunk size to 2MiB, which has the added benefit of
reducing runtime of the RegExp.prototype.exec test.
|
|
|
|
|
|
The grammar for the ECMA-262 CharacterEscape is:
CharacterEscape[U, N] ::
ControlEscape
c ControlLetter
0 [lookahead ∉ DecimalDigit]
HexEscapeSequence
RegExpUnicodeEscapeSequence[?U]
[~U]LegacyOctalEscapeSequence
IdentityEscape[?U, ?N]
It's important to parse the standalone "\0 [lookahead ∉ DecimalDigit]"
before parsing LegacyOctalEscapeSequence. Otherwise, all standalone "\0"
patterns are parsed as octal, which are disallowed in Unicode mode.
Further, LegacyOctalEscapeSequence should also be parsed while parsing
character classes.
|
|
|
|
|
|
* Only alphabetic (A-Z, a-z) characters may be escaped with \c. The loop
currently parsing \c includes code points between the upper/lower case
groups.
* In Unicode mode, all invalid identity escapes should cause a parser
error, even in browser-extended mode.
* Avoid an infinite loop when parsing the pattern "\c" on its own.
|
|
This was exposed to the user by mistake, and even accumulated a bunch of
users that didn't blow up out of sheer luck.
|
|
Fixes all remaining 'built-ins/RegExp/property-escapes' test262 tests.
|
|
|
|
Note that unlike binary properties and general categories, scripts must
be specified in the non-binary (Script=Value) form.
|
|
In non-Unicode mode, the existing MatchState::string_position is tracked
in code units; in Unicode mode, it is tracked in code points.
In order for some RegexStringView operations to be performant, it is
useful for the MatchState to have a field to always track the position
in code units. This will allow RegexStringView methods (e.g. operator[])
to perform lookups based on code unit offsets, rather than needing to
iterate over the entire string to find a code point offset.
|
|
The current method of iterating through the string to access a code
point hurts performance quite badly for very large strings. The test262
test "RegExp/property-escapes/generated/Any.js" previously took 3 hours
to complete; this one change brings it down to under 10 seconds.
|
|
These were previously generated as two instructions, Compare [Inverse]
and Compare [Property].
|