summaryrefslogtreecommitdiff
path: root/Userland/Libraries/LibWeb/HTML/Parser
AgeCommit message (Collapse)Author
2022-03-28LibWeb: Load X(HT)ML documents and transform them into HTML DOMAli Mohammad Pur
2022-03-26LibWeb: Move HTML dimension value parsing from CSS to HTML namespaceAndreas Kling
These are part of HTML, not CSS, so let's not confuse things.
2022-03-24LibWeb: Rename PARSER_DEBUG => HTML_PARSER_DEBUGIdan Horowitz
Since this macro was created we gained a couple more parsers in the system :^)
2022-03-24LibWeb: Remove inheritance of FormAssociatedElement from HTMLElementTimothy Flynn
HTMLObjectElement will need to be both a FormAssociatedElement and a BrowsingContextContainer. Currently, both of these classes inherit from HTMLElement. This can work in C++, but is generally frowned upon, and doesn't play particularly well with the rest of LibWeb. Instead, we can essentially revert commit 3bb5c62 to remove HTMLElement from FormAssociatedElement's hierarchy. This means that objects such as HTMLObjectElement individually inherit from FormAssociatedElement and HTMLElement now. Some caveats are: * FormAssociatedElement still needs to know when the HTMLElement is inserted into and removed from the DOM. This hook is automatically injected via a macro now, while still allowing classes like HTMLInputElement to also know when the element is inserted. * Casting from a DOM::Element to a FormAssociatedElement is now a sideways cast, rather than directly following an inheritance chain. This means static_cast cannot be used here; but we can safely use dynamic_cast since the only 2 instances of this already use RTTI to verify the cast.
2022-03-21LibTextCodec: Don't allocate Strings on encoding normalisationHendiadyoin1
This ripples down to LibWeb's HTML and XHR decoders, which therefore become less allocation heavy.
2022-03-21LibWeb: Implement "has element in select scope" per-specSimon Wanner
The HTML Specification is quite tricky in this case. Usually "have a particular element in <x> scope" mentions "consisting of the following element types:", but in this case it's "consisting of all element types except the following:" Thanks to @AtkinsSJ for spotting this difference
2022-03-20LibWeb: Implement the rest of the Adoption Agency AlgorithmSimon Wanner
This gets us 2 points on html5test.com :^) - Before: https://html5te.st/4cf57659bc08272e (208) - After: https://html5te.st/fb8a9259bda1c115 (210)
2022-03-19LibWeb: Only delay "load" event for script elements that load somethingAndreas Kling
We shouldn't delay the load event for scripts that we're completely refusing to run anyway. Also, for scripts that have inline text content, we don't need to delay them either, as they will become ready before returning from "prepare script". This makes the "load" event finally fire on lots of websites, including Wikipedia. :^)
2022-03-19LibWeb: Don't delay document "load" event for unclosed script tagsAndreas Kling
We previously had a bug where markup with unclosed script tags caused the document load event to be delayed indefinitely. Fix this by only marking script elements as delaying the load event once we encounter the script end tag.
2022-03-17Libraries: Use default constructors/destructors in LibWebLenny Maiorani
https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#cother-other-default-operation-rules "The compiler is more likely to get the default semantics right and you cannot implement these functions better than the compiler."
2022-03-14LibWeb: Use inline script tag source line as javascript line offsetIdan Horowitz
This makes JS exception line numbers meaningful for inline script tags.
2022-03-08LibWeb: Move Window from DOM directory & namespace to HTMLLinus Groh
The Window object is part of the HTML spec. :^) https://html.spec.whatwg.org/multipage/window-object.html
2022-03-02LibWeb: Fix issue where double-quoted doctype system ID was not capturedAndreas Kling
We were storing double-quoted system ID's in the public ID field. 1% progression on ACID3. :^)
2022-03-01LibWeb: Associate form elements with a form in parsing and dynamicallyLuke Wilde
This makes it available for all form associated elements and not just select and input elements. It also makes it more spec compliant, especially around the form attribute. The main thing missing is re-associating form elements with a form attribute when the form attribute changes or an element with an ID is inserted/removed or has its ID changed.
2022-02-21LibWeb: Make document.write() work while document is parsingAndreas Kling
This necessitated making HTMLParser ref-counted, and having it register itself with Document when created. That makes it possible for scripts to add new input at the current parser insertion point. There is now a reference cycle between Document and HTMLParser. This cycle is explicitly broken by calling Document::detach_parser() at the end of HTMLParser::run(). This is a huge progression on ACID3, from 31% to 49%! :^)
2022-02-21LibWeb: Add basic support for dynamic markup insertionLorenz Steinert
This implements basic support for dynamic markup insertion, adding * Document::open() * Document::write(Vector<String> const&) * Document::writeln(Vector<String> const&) * Document::close() The HTMLParser is modified to make it possible to create a script-created parser which initially only contains a HTMLTokenizer without any data. Aditionally the HTMLParser::run method gains an overload which does not modify the Document and does not run HTMLParser::the_end() so that we can reenter the parser at a later time. Furthermore all FIXMEs that consern the insertion point are implemented wich is defined in the HTMLTokenizer. Additionally the following member-variables of the HTMLParser are now exposed by getter funcions: * m_tokenizer * m_aborted * m_script_nesting_level The HTMLTokenizer is modified so that it contains an insertion point which keeps track of where the next input from the Document::write functions will be inserted. The insertion point is implemented as the charakter offset into m_decoded_input and a boolean describing if the insertion point is defined. Functions to update, check and {re}store the insertion point are also added. The function HTMLTokenizer::insert_eof is added to tell a script-created parser that document::close was called and HTMLParser::the_end() should be called. Lastly an explicit default constructor is added to HTMLTokenizer to create a empty HTMLTokenizer into which data can be inserted.
2022-02-21LibWeb: Fix 'Comment end state' in HTML TokenizerAdam Hodgen
Also, update the expected hash in the LibWeb TestHTMLTokenizer regression test. This is due to the "This comment has a few too many dashes." comment token being updated.
2022-02-21LibWeb: Implement tokenization newline preprocessingAdam Hodgen
Newline normalization will replace \r and \r\n with \n. The spec specifically states > Before the tokenization stage, the input stream must be preprocessed > by normalizing newlines. wheras this is implemented the processing during the tokenization itself. This should still exhibit the same behaviour, while keeping the tokenization logic in the same place.
2022-02-21LibWeb: Fix off by one error in HTML TokenizerAdam Hodgen
In 'NamedCharacterReference' we attempt to lookup the code point by a identifier, eg apos; becomes ' This is done by passing the entire rest of the document to the `HTML::code_points_from_entity` function. However, before this change we didn't sent the final character which meant if the document ended in a named character reference the lookup would fail.
2022-02-20LibWeb: Handle markers when reconstructing active formatting elements Luke Wilde
The entry we get from the active formatting elements list during the Rewind step of "reconstruct the active formatting elements" can be a marker. Previously we assumed it was not a marker, which can trigger an assertion failure with certain malformed HTML. If the entry in this step is a marker, the spec simply ignores it. This is step 6 of the algorithm. This also makes the index unsigned, as this algorithm is a no-op if the list is empty. Additionally, this also adds spec comments to this algorithm. Fixes #12668.
2022-02-19LibWeb: Use Vector::clear_with_capacity() in HTMLTokenizerAndreas Kling
This avoids constantly reallocating the Vector<HTMLToken>.
2022-02-15LibWeb: Fail gracefully when reaching the unimplemented part of the AAALinus Groh
Pages such as https://html5test.com are testing all sorts of weird, incomplete, and wrong HTML but can be useful or at least interesting for development - let's try to avoid crashing the process.
2022-02-15LibWeb: Implement state switch for "[CDATA[" in HTML parserLinus Groh
2022-02-15LibWeb: Add an optional pointer to an HTMLParser to the HTMLTokenizerLinus Groh
This is needed to access the 'adjusted current node' in the 'Markup declaration open state'. We don't want to create a full parser for something like syntax highlighting, so it's optional (null) by default.
2022-02-15LibWeb: Remove unused HTMLParser function declarationLinus Groh
There is no implementation of this function: HTMLParser::stack_of_open_elements_has_element_with_tag_name_in_scope
2022-02-15LibWeb: Add spec links to each HTML tokenizer state sectionLinus Groh
I didn't add full spec comments this time, but this is better than nothing :^)
2022-02-15LibWeb: Add spec comments to the StackOfOpenElements classAndreas Kling
2022-02-15LibWeb: Rename element_before() => element_immediately_above()Andreas Kling
This matches the spec terminology around the "stack of open elements".
2022-02-15LibWeb: Add spec comments to find_appropriate_place_for_inserting_node()Andreas Kling
2022-02-14LibWeb: Don't emit current token on EOF in HTML TokenizerKarol Kosek
Emitting tokens on EOF caused an infinite loop, freezing the app, which could be a bit annoying when writing an HTML comment at the end of the file in Text Editor. :^)
2022-02-14LibWeb: Fix highlighting HTML commentsKarol Kosek
Commit b193351a99 caused the HTML comments to flash when changing the text cursor. Also, when double-clicking on a comment, the selection started from the beginning of the file instead. The following message was displaying when `TOKENIZER_TRACE_DEBUG` was enabled: (Tokenizer::nth_last_position) Invalid position requested: 4th-last of 4. Returning (0-0). Changing the `nth_last_position` to 3 fixes this. I'm guessing that's because the parser is at that moment on the second hyphen of the `<!--` string, so it has to go back only by three characters.
2022-02-13LibWeb: Fix off-by-one in HTMLTokenizer::restore_to()MacDue
The difference should be between m_utf8_iterator and the the new position, if m_prev_utf8_iterator is used one fewer source position is popped than required. This issue was not apparent on most pages since restore_to used for tokens such <!doctype> that are normally followed by a newline that resets the column to zero, but it can be seen on pages with minified HTML.
2022-02-08LibWeb: Introduce the Environment Settings ObjectLuke Wilde
The environment settings object is effectively the context a piece of script is running under, for example, it contains the origin, responsible document, realm, global object and event loop for the current context. This effectively replaces ScriptExecutionContext, but it cannot be removed in this commit as EventTarget still depends on it. https://html.spec.whatwg.org/multipage/webappapis.html#environment-settings-object
2021-12-10LibWeb: Fix off-by-one error when highlighting unquoted HTML attributesSam Atkins
This fixes #11166
2021-12-05LibWeb: Cast unused smart-pointer return values to voidSam Atkins
2021-11-11Everywhere: Pass AK::StringView by valueAndreas Kling
2021-10-17LibWeb: Implement Attribute closer to the spec and with an IDL fileTimothy Flynn
Note our Attribute class is what the spec refers to as just "Attr". The main differences between the existing implementation and the spec are just that the spec defines more fields. Attributes can contain namespace URIs and prefixes. However, note that these are not parsed in HTML documents unless the document content-type is XML. So for now, these are initialized to null. Web pages are able to set the namespace via JavaScript (setAttributeNS), so these fields may be filled in when the corresponding APIs are implemented. The main change to be aware of is that an attribute is a node. This has implications on how attributes are stored in the Element class. Nodes are non-copyable and non-movable because these constructors are deleted by the EventTarget base class. This means attributes cannot be stored in a Vector or HashMap as these containers assume copyability / movability. So for now, the Vector holding attributes is changed to hold RefPtrs to attributes instead. This might change when attribute storage is implemented according to the spec (by way of NamedNodeMap).
2021-10-10LibWeb: Remove dead "outer loop" code in adoption agency algorithmBrian Gianforcaro
2021-10-01LibWeb: Check for HTML integration points in the tree constructorLuke Wilde
This particularly implements these two points: - "If the adjusted current node is an HTML integration point and the token is a start tag" - "If the adjusted current node is an HTML integration point and the token is a character token" This also adds spec comments to the tree constructor.
2021-09-26LibWeb: Add the PageTransitionEvent interface and fire "pageshow" eventsAndreas Kling
We now fire "pageshow" events at the appropriate time during document loading (done by the parser.) Note that there are no corresponding "pagehide" events yet.
2021-09-26LibWeb: Add a "page showing" flag to documentsAndreas Kling
This will be used to determine whether "pageshow" and "pagehide" events are appropriate. We won't actually make use of it until we implement more of history traversal and document unloading.
2021-09-26LibWeb: Implement "update the current document readiness" from specAndreas Kling
The only difference from what we were already doing is that setting the same ready state twice no longer fires a "readystatechange" event. I don't think that could happen in practice though.
2021-09-26LibWeb: Store HTML document ready state as an enumAndreas Kling
2021-09-26LibWeb: Allow HTML parser to delay delivery of the document "load" eventAndreas Kling
We will now spin in "the end" until there are no more "things delaying the load event". Of course, nothing actually uses this yet, and there are a lot of things that need to.
2021-09-26LibWeb: Implement more of HTMLParser::the_end() and bring closer to specAndreas Kling
2021-09-26LibWeb: Split out "The end" from the HTML parsing spec to a functionAndreas Kling
Also add a spec link and some comments.
2021-09-25LibWeb: Rename HTMLDocumentParser => HTMLParserAndreas Kling
2021-09-21Libraries: Use AK::Variant default initialization where appropriateBen Wiederhake
2021-09-20LibWeb: Make <script src> loads partially async (by following the spec)Andreas Kling
Instead of firing up a network request and synchronously blocking for it to finish via a nested event loop, we now start an asynchronous request when encountering <script src>. Once the script load finishes (or fails), it gets executed at one of the synchronization points in the HTML parser. This solves some long-standing issues with random unexpected events getting dispatched in the middle of parsing.
2021-09-20LibWeb: Pop entire stack of open elements at the end of parsingAndreas Kling