summaryrefslogtreecommitdiff
path: root/Userland/Libraries/LibPDF/Parser.cpp
AgeCommit message (Collapse)Author
2022-10-16LibPDF: Allow text operator sequences to start with whitespaceJulian Offenhäuser
2022-09-17LibPDF: Allow whitespace other than EOL after an object markerJulian Offenhäuser
2022-09-17LibPDF: Disallow parsing indirect values as operandsJulian Offenhäuser
An operation like 0 0 0 RG would have been confused for [ 0, 0 0 R ] G
2022-09-17LibPDF: Move document-specific parsing functionality into its own classJulian Offenhäuser
The Parser class is now a generic PDF object parser, of which the new DocumentParser class derives. DocumentParser now takes over all functions relating to linearization, pages, xref and trailer handling. This allows the use of multiple parsers in the same document's context, which will be needed in order to handle PDF object streams.
2022-09-17LibPDF: Move consume and match helper functions to the Reader classJulian Offenhäuser
2022-07-12Everywhere: Add sv suffix to strings relying on StringView(char const*)sin-ack
Each of these strings would previously rely on StringView's char const* constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.
2022-04-01Everywhere: Run clang-formatIdan Horowitz
2022-03-31LibPDF: Rename Command to OperatorMatthew Olsson
This is the correct name, according to the spec
2022-03-29LibPDF: Attempt to unecrypt strings and streamsMatthew Olsson
2022-03-29LibPDF: Require Document* in Parser constructorMatthew Olsson
This makes it a bit easier to avoid calling parser->set_document, an issue which cost me ~30 minutes to find.
2022-03-29LibPDF: Keep track of the current object index/generation while ParsingMatthew Olsson
This information is required to decrypt encrypted strings/streams.
2022-03-29LibPDF: Get rid of PlainText/Encoded StreamObjectMatthew Olsson
This was a small optimization to allow a stream object to simply hold a reference to the bytes in a PDF document rather than duplicating them. However, as we move into features such as encryption, this optimization does more harm than good. This can be revisited in the future if necessary.
2022-03-07LibPDF: Allow newlines between xref table and "trailer" keywordMatthew Olsson
2022-03-07LibPDF: Fix bad hex string parsing logicMatthew Olsson
2022-03-07LibPDF: Remove useless hex string substring callMatthew Olsson
2022-03-07LibPDF: Propagate errors in Parser and DocumentMatthew Olsson
2022-03-07LibPDF: Remove unused function in ParserMatthew Olsson
2022-01-24LibPDF: Make Filter::decode() return ErrorOrSam Atkins
2022-01-24Everywhere: Convert ByteBuffer factory methods from Optional -> ErrorOrSam Atkins
Apologies for the enormous commit, but I don't see a way to split this up nicely. In the vast majority of cases it's a simple change. A few extra places can use TRY instead of manual error checking though. :^)
2022-01-08LibPDF: Convert `PDF::Parser::m_document` from `RefPtr` to `WeakPtr`Simon Woertz
Otherwise both `PDF::Document` and `PDF::Parser` have a `RefPtr` pointing to each other which leads to a memory leak due to a circular dependency.
2021-11-17AK: Convert AK::Format formatting helpers to returning ErrorOr<void>Andreas Kling
This isn't a complete conversion to ErrorOr<void>, but a good chunk. The end goal here is to propagate buffer allocation failures to the caller, and allow the use of TRY() with formatting functions.
2021-11-16LibPDF: Check if there is data left before consumingSimon Woertz
Add a check to `Parser::consume_eol` to ensure that there is more data to read before actually consuming any data. Not checking if there is data left leads to failing an assertion in case of e.g., a truncated pdf file.
2021-11-11Everywhere: Pass AK::ReadonlyBytes by valueAndreas Kling
2021-11-10AK: Make ByteBuffer::try_* functions return ErrorOr<void>Andreas Kling
Same as Vector, ByteBuffer now also signals allocation failure by returning an ENOMEM Error instead of a bool, allowing us to use the TRY() and MUST() patterns.
2021-10-30LibPDF: Parser::parse_header() return false if remaining bytes is zeroBrendan Coles
2021-09-20LibPDF: Replace Value class by AK::VariantBen Wiederhake
This decreases the memory consumption by LibPDF by 4 bytes per Value, compensating exactly for the increase in an earlier commit. :^)
2021-09-20LibPDF: Extract reference bitpacking into dedicated classBen Wiederhake
2021-09-20LibPDF: Move inline function definitionBen Wiederhake
This breaks the dependency cycle between Parser and Document.
2021-09-06Everywhere: Make ByteBuffer::{create_*,copy}() OOM-safeAli Mohammad Pur
2021-09-06Everywhere: Use OOM-safe ByteBuffer APIs where possibleAli Mohammad Pur
If we can easily communicate failure, let's avoid asserting and report failure instead.
2021-07-19Everywhere: Use AK/Math.h if applicableHendiadyoin1
AK's version should see better inlining behaviors, than the LibM one. We avoid mixed usage for now though. Also clean up some stale math includes and improper floatingpoint usage.
2021-07-16LibPDF: Fix treating not finding the linearized dict as a fatal errorWesley Moret
We now try to parse the first indirect value and see if it's the `Linearization Parameter Dictionary`. if it's not, we fallback to reading the xref table from the end of the document
2021-07-16LibPDF: Fix checking `minor_ver` instead of `major_ver`Wesley Moret
2021-06-12LibPDF: Convert to east-const to comply with the recent style changesMatthew Olsson
2021-06-12LibPDF: Parse hint tablesMatthew Olsson
This code isn't _actually_ used as of right now, but I wrote it at the same time as all of the code in the previous commit. I realized after I wrote it that these hint tables aren't super useful if the parser already has access to the full file. However, this will be useful if we ever want to stream PDFs from the web (and possibly view them in the browser).
2021-06-12LibPDF: Parse linearized PDF filesMatthew Olsson
This is a big step, as most PDFs which are downloaded online will be linearized. Pretty much the only difference is that the xref structure is slightly different.
2021-06-12LibPDF: Fix two parser bugsMatthew Olsson
- A newline was assumed to follow the "stream" keyword, when it can also be a windows-style line break - Fix not consuming the "endobj" at the end of every indirect object
2021-06-12LibPDF: Refine the distinction between the Document and ParserMatthew Olsson
The Parser should hold information relevant for parsing, whereas the Document should hold information relevant for displaying pages. With this in mind, there is no reason for the Document to hold the xref table and trailer. These objects have been moved to the Parser, which allows the Parser to expose less public methods (which will be even more evident once linearized PDFs are supported).
2021-06-12LibPDF: Harden the document/parser against errorsMatthew Olsson
2021-06-12LibPDF: Differentiate Value's null and empty statesMatthew Olsson
2021-06-06AK+Everywhere: Disallow constructing Functions from incompatible typesAli Mohammad Pur
Previously, AK::Function would accept _any_ callable type, and try to call it when called, first with the given set of arguments, then with zero arguments, and if all of those failed, it would simply not call the function and **return a value-constructed Out type**. This lead to many, many, many hard to debug situations when someone forgot a `const` in their lambda argument types, and many cases of people taking zero arguments in their lambdas to ignore them. This commit reworks the Function interface to not include any such surprising behaviour, if your function instance is not callable with the declared argument set of the Function, it can simply not be assigned to that Function instance, end of story.
2021-05-25LibPDF: Pre-initialize common FlyStrings in CommonNames.hMatthew Olsson
2021-05-25LibPDF: Handle string encodingsMatthew Olsson
Strings can be encoded in either UTF16-BE or UTF8. In either case, there are a few initial bytes which specify the encoding that must be checked and also removed from the final string.
2021-05-25LibPDF: Store indirect value refs in Value objectsMatthew Olsson
IndirectValueRef is so simple that it can be stored directly in the Value class instead of being heap allocated. As the comment in Value says, however, in theory the max bits needed to store is 48 (16 for the generation index and 32(?) for the object index), but 32 should be good enough for now. We can increase it to u64 later if necessary.
2021-05-25LibPDF: Add support for stream filtersMatthew Olsson
This commit also splits up StreamObject into PlainTextStreamObject and EncodedStreamObject, which is essentially just a stream object which does not own its bytes vs one which does.
2021-05-25LibPDF: Do not assume value is an object in parse_indirect_valueMatthew Olsson
2021-05-18LibPDF: Parse graphics commandsMatthew Olsson
2021-05-18LibPDF: Don't rely on a stream's /Length key existingMatthew Olsson
Some PDFs omit this key apparently, but Firefox opens them fine. Let's emulate that behavior.
2021-05-10LibPDF: Parse nested Page Tree structuresMatthew Olsson
We now follow nested page tree nodes to find all of the actual page dicts, whereas previously we just assumed the root level page tree node contained all of the page children directly.
2021-05-10LibPDF: Parse page structuresMatthew Olsson
This commit introduces the ability to parse the document catalog dict, as well as the page tree and individual pages. Pages obviously aren't fully parsed, as we won't care about most of the fields until we start actually rendering PDFs. One of the primary benefits of the PDF format is laziness. PDFs are not meant to be parsed all at once, and the same is true for pages. When a Document is constructed, it builds a map of page number to object index, but it does not fetch and parse any of the pages. A page is only parsed when a caller requests that particular page (and is cached going forwards). Additionally, this commit also adds an object_cast function which logs bad casts if DEBUG_PDF is set. Additionally, utility functions were added to ArrayObject and DictObject to get all types of objects from the collections to avoid having to manually cast.