Age | Commit message (Collapse) | Author |
|
This was exposed to the user by mistake, and even accumulated a bunch of
users that didn't blow up out of sheer luck.
|
|
Fixes all remaining 'built-ins/RegExp/property-escapes' test262 tests.
|
|
|
|
Note that unlike binary properties and general categories, scripts must
be specified in the non-binary (Script=Value) form.
|
|
In non-Unicode mode, the existing MatchState::string_position is tracked
in code units; in Unicode mode, it is tracked in code points.
In order for some RegexStringView operations to be performant, it is
useful for the MatchState to have a field to always track the position
in code units. This will allow RegexStringView methods (e.g. operator[])
to perform lookups based on code unit offsets, rather than needing to
iterate over the entire string to find a code point offset.
|
|
The current method of iterating through the string to access a code
point hurts performance quite badly for very large strings. The test262
test "RegExp/property-escapes/generated/Any.js" previously took 3 hours
to complete; this one change brings it down to under 10 seconds.
|
|
These were previously generated as two instructions, Compare [Inverse]
and Compare [Property].
|
|
Before now, only binary properties could be parsed. Non-binary props are
of the form "Type=Value", where "Type" may be General_Category, Script,
or Script_Extension (or their aliases). Of these, LibUnicode currently
supports General_Category, so LibRegex can parse only that type.
|
|
This changes LibRegex to parse the property escape as a Variant of
Unicode Property & General Category values. A byte code instruction is
added to perform matching based on General Category values.
|
|
It was previously copying the entire vector every time, which is not a
nice thing to do. :^)
|
|
This makes it avoid the excessively high malloc() traffic.
|
|
This makes very fork-heavy expressions (like `(aa)*`) not run out of
stack space when matching very long strings.
|
|
|
|
This supports some binary property matching. It does not support any
properties not yet parsed by LibUnicode, nor does it support value
matching (such as Script_Extensions=Latin).
|
|
Adds a static method to parse a regex pattern and return the result, and
a constructor to accept a parse result. This is to allow LibJS to parse
the pattern string of a RegExpLiteral once and hand off regex objects
any number of times thereafter.
|
|
The Regex object created a copy of the pattern string anyways, so tweak
the constructor to allow callers to move() pattern strings into the
regex.
The Regex move constructor and assignment operator currently result in
memory corruption. The Regex object stores a Matcher object, which holds
a reference to the Regex object. So when the Regex object is moved, that
reference is no longer valid. To fix this, the reference stored in the
Matcher must be updated when the Regex is moved.
|
|
|
|
Otherwise we'd just loop trying to parse it over and over again, for
instance in `/a{/` or `/a{1,/`.
Unless we're parsing in Annex B mode, which allows `{` as a normal
ExtendedSourceCharacter.
|
|
Even though the contents are supposed to be reset, the type should stay
unchanged, as that's an assumption the engine is making.
|
|
When the Unicode flag is set, regular expressions may escape code points
by surrounding the hexadecimal code point with curly braces, e.g. \u{41}
is the character "A".
When the Unicode flag is not set, this should be considered a repetition
symbol - \u{41} is the character "u" repeated 41 times. This is left as
a TODO for now.
|
|
This will work for ASCII code points. Unicode case folding will be
needed for non-ASCII.
|
|
When the Unicode option is not set, regular expressions should match
based on code units; when it is set, they should match based on code
points. To do so, the regex parser must combine surrogate pairs when
the Unicode option is set. Further, RegexStringView needs to know if
the flag is set in order to return code point vs. code unit based
string lengths and substrings.
|
|
|
|
ECMA262 requires that the capture groups only contain the values from
the last iteration, e.g. `((c)(a)?(b))` should _not_ contain 'a' in the
second capture group when matching "cabcb".
|
|
We have a dedicated format specifier which adds the "0x" prefix, so
let's use that instead of adding it manually.
|
|
This commit makes LibRegex (mostly) capable of operating on any of
the three main string views:
- StringView for raw strings
- Utf8View for utf-8 encoded strings
- Utf32View for raw unicode strings
As a result, regexps with unicode strings should be able to properly
handle utf-8 and not stop in the middle of a code point.
A future commit will update LibJS to use the correct type of string
depending on the flags.
|
|
|
|
Otherwise the new debug line would be printed right after the previous
one without a newline.
|
|
|
|
Co-authored-by: Timothy Flynn <trflynn89@pm.me>
|
|
|
|
|
|
|
|
|
|
|
|
To comply with Dr.POSIX, this field has to be user-accessible.
|
|
Otherwise the users won't know how many capture groups are in the
parsed regular expression.
|
|
Commonly, bracket expressions are in fact, enclosed in brackets.
|
|
Fixes part of #8506.
|
|
This implements the internal regex stuff for #8506.
|
|
They were pretty confusing when compared with other non-transforming
functions.
|
|
If the sticky flag is set, the regex execution loop should break
immediately even if the execution was a failure. The specification for
several RegExp.prototype methods (e.g. exec and @@split) rely on this
behavior.
|
|
Fixes 1 test262 test.
|
|
For some reason the default move constructor and default move-assign
operator were deleted, so we explicitly default them instead.
|
|
Instead of doing a VERIFY(is<T>(x)) and *then* casting it to T, we can
just do the cast right away with verify_cast<T>. :^)
|
|
When REGEX_DEBUG is enabled, LibRegex dumps a table of information
regarding the state of the regex bytecode execution. The Compare opcode
manipulates state.string_position directly, so the string_position value
cannot be used to display where the comparison started; therefore, this
patch introduces a new variable to keep track of where we were before
the comparison happened.
|
|
A tiny typo was introduced in bc8d16ad which caused all case insensitive
comparisons to fail.
|
|
|
|
|
|
|