Age | Commit message (Collapse) | Author |
|
Fixes #13755.
Co-Authored-By: Damien Firmenich <fir.damien@gmail.com>
|
|
|
|
Just a little thinking outside the box, and we can now parse and
optimise a million copies of "a|" chained together in just a second :^)
|
|
This helps us not blow up when too many disjunctions are chained togther
in the regex we're parsing.
Fixes #12615.
|
|
While quantifying assertions is very much meaningless, the specification
allows them with annex B's extended grammar for browsers, so read and
apply the quantifiers.
Fixes #12373.
|
|
Previously we were compiling `/a|/` into what effectively would be
`/|a`, which is clearly incorrect.
|
|
It makes no sense to skip half of an instruction, so make sure to skip
only full instructions!
|
|
ECMA-262 defines \s as:
Return the CharSet containing all characters corresponding to a code
point on the right-hand side of the WhiteSpace or LineTerminator
productions.
The LineTerminator production is simply: U+000A, U+000D, U+2028, or
U+2029. Unfortunately there isn't a Unicode property that covers just
those code points.
The WhiteSpace production is: U+0009, U+000B, U+000C, U+FEFF, or any
code point with the Space_Separator general category.
If the Unicode generators are disabled, this will fall back to ASCII
space code points.
|
|
LibRegex already implements this loop in a more performant way, so all
LibJS has to do here is to return things in the right shape, and not
loop over the input string.
Previously this was a quadratic operation on string length, which lead
to crazy execution times on failing regexps - now it's nice and fast :^)
Note that a Regex test has to be updated to remove the stateful flag as
it repeats matching on multiple strings.
|
|
All of JS's regular expression APIs only want a single match, so avoid
trying to produce more (which will be discarded anyway).
|
|
As ECMA262 regex allows `[^]` and literal newlines to match newlines in
the input string, we shouldn't split the input string into lines, rather
simply make boundaries and catchall patterns capable of checking for
these conditions specifically.
|
|
Instead of leaking all capture groups and selectively clearing some,
simply avoid leaking things and only "define" the ones that need to
exist.
This *actually* implements the capture groups ECMA262 quirk.
Also adds the test removed in the previous commit (to avoid messing up
test runs across bisects).
|
|
This partially reverts commit c11be92e23d899e28d45f67be24e47b2e5114d3a.
That commit fixes one thing and breaks many more, a next commit will
implement this quirk in a more sane way.
|
|
...only if Multiline is not enabled.
Fixes #11940.
|
|
This implements the quirk defined by "Note 3" in section "Canonicalize"
(https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-ch).
Crosses off another quirk from #6042.
|
|
Previously we were jumping to the new end of the previous block (created
by the newly inserted ForkStay), correct the offset to jump to the
correct block as shown in the comments.
Fixes #12033.
|
|
|
|
These were missed in 565a880ce5a14bac817c73916e91ebfa04c8b99b.
This wasn't an issue because these tests don't pledge/unveil anything,
so they could happily dlopen() the library at runtime. But this is now
needed in order to migrate LibUnicode towards weak symbols instead.
|
|
This makes negative lookarounds with more than one fork behave
correctly.
Fixes #11350.
|
|
|
|
|
|
The instructions can have dependencies (e.g. Repeat), so only unify
equal blocks instead of consecutive instructions.
Fixes #11247.
Also adds the minimal test case(s) from that issue.
|
|
The initial `ForkStay` is only needed if the looping block has a
following block, if there's no following block or the following block
does not attempt to match anything, we should not insert the ForkStay,
otherwise we would be rewriting `a+` as `a*` by allowing the 'end' to be
executed.
Fixes #10952.
|
|
|
|
Preparation for using Error.h from Vector.h. This required moving some
things out of line.
|
|
|
|
Doing so would cause patterns like `(a|)` to not match the empty string.
|
|
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
|
|
Using StringView instead of C strings is basically always preferable.
The only reason to use a C string is because you are calling a C API.
|
|
Otherwise the fork in patterns like `(1+)\1` would be (incorrectly)
optimized away.
|
|
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
|
|
Otherwise the left and right capture instructions wouldn't point to the
same capture group if there was another nested group there.
|
|
|
|
This makes (addmittedly weird) patterns like `(a*)*` work correctly
without going into an infinite fork loop.
|
|
Using a file(GLOB) to find all the test files in a directory is an easy
hack to get things started, but has some drawbacks. Namely, if you add
a test, it won't be found again without re-running CMake. `ninja` seems
to do this automatically, but it would be nice to one day stop seeing it
rechecking our globbed directories.
|
|
That check was rather pointless as the input is a StringView which knows
its own bounds.
Fixes #9686.
|
|
For example, consider the following pattern:
new RegExp('\ud834\udf06', 'u')
With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.
Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
|
|
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.
Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.
|
|
This was missed in commit 27d555bab0d84913599cea3c4a6b0a0ed2a15b66.
|
|
|
|
Currently, when we need to repeat an instruction N times, we simply add
that instruction N times in a for-loop. This doesn't scale well with
extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1.
Instead, add a new REPEAT bytecode operation to defer this loop from the
parser to the runtime executor. This allows the parser to complete sans
any loops (for this instruction), and allows the executor to bail early
if the repeated bytecode fails.
Note: The templated ByteCode methods are to allow the Posix parsers to
continue using u32 because they are limited to N = 2^20.
|
|
Combining these into one list helps reduce the size of MatchState, and
as a result, reduces the amount of memory consumed during execution of
very large regex matches.
Doing this also allows us to remove a few regex byte code instructions:
ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference.
Named groups now behave the same as unnamed groups for these operations.
Note that SaveRightNamedCaptureGroup still exists to cache the matched
group name.
This also removes the recursion level from the MatchState, as it can
exist as a local variable in Matcher::execute instead.
|
|
|
|
|
|
The grammar for the ECMA-262 CharacterEscape is:
CharacterEscape[U, N] ::
ControlEscape
c ControlLetter
0 [lookahead ∉ DecimalDigit]
HexEscapeSequence
RegExpUnicodeEscapeSequence[?U]
[~U]LegacyOctalEscapeSequence
IdentityEscape[?U, ?N]
It's important to parse the standalone "\0 [lookahead ∉ DecimalDigit]"
before parsing LegacyOctalEscapeSequence. Otherwise, all standalone "\0"
patterns are parsed as octal, which are disallowed in Unicode mode.
Further, LegacyOctalEscapeSequence should also be parsed while parsing
character classes.
|
|
A subsequent commit will add tests that require a string containing only
"\0". As a C-string, this will be interpreted as the null terminator. To
make the diff for that commit easier to grok, this commit converts all
tests to use StringView without any other functional changes.
|
|
|
|
|
|
* Only alphabetic (A-Z, a-z) characters may be escaped with \c. The loop
currently parsing \c includes code points between the upper/lower case
groups.
* In Unicode mode, all invalid identity escapes should cause a parser
error, even in browser-extended mode.
* Avoid an infinite loop when parsing the pattern "\c" on its own.
|
|
Fixes all remaining 'built-ins/RegExp/property-escapes' test262 tests.
|