Age | Commit message (Collapse) | Author |
|
|
|
This test file had #ifdef macros at the top that caused none of the
content to be compiled unless a developer manually wanted to run the
specific benchmarks within. As such, it has become stale. Remove it for
now, if someone wants to restore it in an always-runnable state, we can
restore the specific tests it's trying to benchmark.
|
|
|
|
Currently, LibUnicodeData contains the generated UCD and CLDR data. Move
the UCD data to the main LibUnicode library, and rename LibUnicodeData
to LibLocaleData. This is another prepatory change to migrate to
LibLocale.
|
|
Previously, for a regex such as /[a-sy-z]/i, we would incorrectly think
the character "u" fell into the range "a-s" because neither of the
conditions "u > s && U > s" or "u < a && U < a" would be true, resulting
in the lookup falling back to assuming the character is in the range.
Instead, first explicitly check if the character falls into the range,
rather than checking if it falls outside the range. If the explicit
checks fail, then we know the character is outside the range.
|
|
This skips the new string unicode properties additions, along with \q{}.
|
|
This prevents us from needing a sv suffix, and potentially reduces the
need to run generic code for a single character (as contains,
starts_with, ends_with etc. for a char will be just a length and
equality check).
No functional changes.
|
|
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
|
|
[^XYZ] is not(X | Y | Z), we used to translate this to
not(X) | not(Y) | not(Z), this commit makes LibRegex interpret this
pattern as not(X) & not(Y) & not(Z).
|
|
The lowercase version of a range is not required to be a valid range,
instead of casefolding the range and making it invalid, check twice with
both cases of the input character (which are the same as the input if
not insensitive).
This time includes an actual test :^)
|
|
Otherwise the range order would be inverted.
|
|
We had a really naive and simplistic implementation, which lead to
various issues where the optimiser incorrectly rewrote the regex to use
atomic groups; this commit fixes that.
|
|
Fixes #13755.
Co-Authored-By: Damien Firmenich <fir.damien@gmail.com>
|
|
|
|
Just a little thinking outside the box, and we can now parse and
optimise a million copies of "a|" chained together in just a second :^)
|
|
This helps us not blow up when too many disjunctions are chained togther
in the regex we're parsing.
Fixes #12615.
|
|
While quantifying assertions is very much meaningless, the specification
allows them with annex B's extended grammar for browsers, so read and
apply the quantifiers.
Fixes #12373.
|
|
Previously we were compiling `/a|/` into what effectively would be
`/|a`, which is clearly incorrect.
|
|
It makes no sense to skip half of an instruction, so make sure to skip
only full instructions!
|
|
ECMA-262 defines \s as:
Return the CharSet containing all characters corresponding to a code
point on the right-hand side of the WhiteSpace or LineTerminator
productions.
The LineTerminator production is simply: U+000A, U+000D, U+2028, or
U+2029. Unfortunately there isn't a Unicode property that covers just
those code points.
The WhiteSpace production is: U+0009, U+000B, U+000C, U+FEFF, or any
code point with the Space_Separator general category.
If the Unicode generators are disabled, this will fall back to ASCII
space code points.
|
|
LibRegex already implements this loop in a more performant way, so all
LibJS has to do here is to return things in the right shape, and not
loop over the input string.
Previously this was a quadratic operation on string length, which lead
to crazy execution times on failing regexps - now it's nice and fast :^)
Note that a Regex test has to be updated to remove the stateful flag as
it repeats matching on multiple strings.
|
|
All of JS's regular expression APIs only want a single match, so avoid
trying to produce more (which will be discarded anyway).
|
|
As ECMA262 regex allows `[^]` and literal newlines to match newlines in
the input string, we shouldn't split the input string into lines, rather
simply make boundaries and catchall patterns capable of checking for
these conditions specifically.
|
|
Instead of leaking all capture groups and selectively clearing some,
simply avoid leaking things and only "define" the ones that need to
exist.
This *actually* implements the capture groups ECMA262 quirk.
Also adds the test removed in the previous commit (to avoid messing up
test runs across bisects).
|
|
This partially reverts commit c11be92e23d899e28d45f67be24e47b2e5114d3a.
That commit fixes one thing and breaks many more, a next commit will
implement this quirk in a more sane way.
|
|
...only if Multiline is not enabled.
Fixes #11940.
|
|
This implements the quirk defined by "Note 3" in section "Canonicalize"
(https://tc39.es/ecma262/#sec-runtime-semantics-canonicalize-ch).
Crosses off another quirk from #6042.
|
|
Previously we were jumping to the new end of the previous block (created
by the newly inserted ForkStay), correct the offset to jump to the
correct block as shown in the comments.
Fixes #12033.
|
|
|
|
These were missed in 565a880ce5a14bac817c73916e91ebfa04c8b99b.
This wasn't an issue because these tests don't pledge/unveil anything,
so they could happily dlopen() the library at runtime. But this is now
needed in order to migrate LibUnicode towards weak symbols instead.
|
|
This makes negative lookarounds with more than one fork behave
correctly.
Fixes #11350.
|
|
|
|
|
|
The instructions can have dependencies (e.g. Repeat), so only unify
equal blocks instead of consecutive instructions.
Fixes #11247.
Also adds the minimal test case(s) from that issue.
|
|
The initial `ForkStay` is only needed if the looping block has a
following block, if there's no following block or the following block
does not attempt to match anything, we should not insert the ForkStay,
otherwise we would be rewriting `a+` as `a*` by allowing the 'end' to be
executed.
Fixes #10952.
|
|
|
|
Preparation for using Error.h from Vector.h. This required moving some
things out of line.
|
|
|
|
Doing so would cause patterns like `(a|)` to not match the empty string.
|
|
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
|
|
Using StringView instead of C strings is basically always preferable.
The only reason to use a C string is because you are calling a C API.
|
|
Otherwise the fork in patterns like `(1+)\1` would be (incorrectly)
optimized away.
|
|
This currently tries to convert forking loops to atomic groups, and
unify the left side of alternations.
|
|
Otherwise the left and right capture instructions wouldn't point to the
same capture group if there was another nested group there.
|
|
|
|
This makes (addmittedly weird) patterns like `(a*)*` work correctly
without going into an infinite fork loop.
|
|
Using a file(GLOB) to find all the test files in a directory is an easy
hack to get things started, but has some drawbacks. Namely, if you add
a test, it won't be found again without re-running CMake. `ninja` seems
to do this automatically, but it would be nice to one day stop seeing it
rechecking our globbed directories.
|
|
That check was rather pointless as the input is a StringView which knows
its own bounds.
Fixes #9686.
|
|
For example, consider the following pattern:
new RegExp('\ud834\udf06', 'u')
With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.
Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
|
|
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.
Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.
|