summaryrefslogtreecommitdiff
path: root/Userland/Libraries/LibPDF/Parser.cpp
AgeCommit message (Collapse)Author
2023-05-19LibPDF: Avoid unnecessary HashMap copy, mark other copiesBen Wiederhake
2023-02-19LibTextCodec+Everywhere: Port Decoders to new StringsSam Atkins
2023-02-15LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringViewSam Atkins
We don't need a full String/DeprecatedString inside this function, so we might as well not force users to create one.
2023-02-12LibPDF: Allow filter DecodeParms array entries to be nullJulian Offenhäuser
Filters will use the default values in this case.
2023-02-08LibPDF: Improve stream parsingRodrigo Tobar
When parsing streams we rely on a /Length item being defined in the stream's dictionary to know how much data comprises the stream. Its value is usually a direct value, but it can be indirect. There was however a contradiction in the code: the condition that allowed it to read and use the /Length value required it to be a direct value, but the actual code using the value would have worked with indirect ones. This meant that indirect /Length values triggered the fallback, "manual" stream parsing code. On the other hand, this latter code was also buggy, because it relied on the "endstream" keyword to appear on a separate line, which isn't always the case. This commit both fixes the bug in the manual stream parsing scenario, while also allowing for indirect /Length values to be used to parse streams more directly and avoid the manual approach. The main caveat to this second change is that for a brief period of time the Document is not able to resolve references (i.e., before the xref table itself is not parsed). Any parsing happening before that (e..g, the linearization dictionary) must therefore use the manual stream parsing approach.
2023-01-09AK+Everywhere: Rename FlyString to DeprecatedFlyStringTimothy Flynn
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so let's rename it to A) match the name of DeprecatedString, B) write a new FlyString class that is tied to String.
2023-01-09LibPDF: Allow numbers to start with whitespaceJulian Offenhäuser
2022-12-16LibPDF: Add support for multi-line commentsRodrigo Tobar
The code parsing comments parsed only a single line of comments, but callers assumed they parsed all comments that appeared contiguously in a block. The latter is an easier to understand API, so this commit changes the parse_comment function to parse entire blocks of comments instead of single lines.
2022-12-06Everywhere: Rename to_{string => deprecated_string}() where applicableLinus Groh
This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.
2022-12-06AK+Everywhere: Rename String to DeprecatedStringLinus Groh
We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)
2022-11-30LibPDF: Ignore whitespace on hex stringsRodrigo Tobar
The spec says that whitespaces should be ignored, but we weren't. PDFs with whitespaces in their hex strings were thus crushing the parser.
2022-11-19LibPDF: Parse integer numbers with atoi() instead of strtof()Julian Offenhäuser
strtof() produces rounding errors for very large numbers, which we don't want for integers, as they may have to be precise.
2022-11-19LibPDF: Implement png predictor decoding for flate filterJulian Offenhäuser
For flate and lzw filters, the data can be transformed by this predictor function to make it compress better. For us this means that we have to undo this step in order to get the right result. Although this feature is meant for images, I found at least a few documents that use it all over the place, making this step very important.
2022-11-19LibPDF: Support cascading stream filtersJulian Offenhäuser
You can specify multiple filters as an array, where each one is fed the output of the one before it.
2022-11-19LibPDF: Parse hexadecimal values in name objects correctlyJulian Offenhäuser
2022-10-16LibPDF: Allow text operator sequences to start with whitespaceJulian Offenhäuser
2022-09-17LibPDF: Allow whitespace other than EOL after an object markerJulian Offenhäuser
2022-09-17LibPDF: Disallow parsing indirect values as operandsJulian Offenhäuser
An operation like 0 0 0 RG would have been confused for [ 0, 0 0 R ] G
2022-09-17LibPDF: Move document-specific parsing functionality into its own classJulian Offenhäuser
The Parser class is now a generic PDF object parser, of which the new DocumentParser class derives. DocumentParser now takes over all functions relating to linearization, pages, xref and trailer handling. This allows the use of multiple parsers in the same document's context, which will be needed in order to handle PDF object streams.
2022-09-17LibPDF: Move consume and match helper functions to the Reader classJulian Offenhäuser
2022-07-12Everywhere: Add sv suffix to strings relying on StringView(char const*)sin-ack
Each of these strings would previously rely on StringView's char const* constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.
2022-04-01Everywhere: Run clang-formatIdan Horowitz
2022-03-31LibPDF: Rename Command to OperatorMatthew Olsson
This is the correct name, according to the spec
2022-03-29LibPDF: Attempt to unecrypt strings and streamsMatthew Olsson
2022-03-29LibPDF: Require Document* in Parser constructorMatthew Olsson
This makes it a bit easier to avoid calling parser->set_document, an issue which cost me ~30 minutes to find.
2022-03-29LibPDF: Keep track of the current object index/generation while ParsingMatthew Olsson
This information is required to decrypt encrypted strings/streams.
2022-03-29LibPDF: Get rid of PlainText/Encoded StreamObjectMatthew Olsson
This was a small optimization to allow a stream object to simply hold a reference to the bytes in a PDF document rather than duplicating them. However, as we move into features such as encryption, this optimization does more harm than good. This can be revisited in the future if necessary.
2022-03-07LibPDF: Allow newlines between xref table and "trailer" keywordMatthew Olsson
2022-03-07LibPDF: Fix bad hex string parsing logicMatthew Olsson
2022-03-07LibPDF: Remove useless hex string substring callMatthew Olsson
2022-03-07LibPDF: Propagate errors in Parser and DocumentMatthew Olsson
2022-03-07LibPDF: Remove unused function in ParserMatthew Olsson
2022-01-24LibPDF: Make Filter::decode() return ErrorOrSam Atkins
2022-01-24Everywhere: Convert ByteBuffer factory methods from Optional -> ErrorOrSam Atkins
Apologies for the enormous commit, but I don't see a way to split this up nicely. In the vast majority of cases it's a simple change. A few extra places can use TRY instead of manual error checking though. :^)
2022-01-08LibPDF: Convert `PDF::Parser::m_document` from `RefPtr` to `WeakPtr`Simon Woertz
Otherwise both `PDF::Document` and `PDF::Parser` have a `RefPtr` pointing to each other which leads to a memory leak due to a circular dependency.
2021-11-17AK: Convert AK::Format formatting helpers to returning ErrorOr<void>Andreas Kling
This isn't a complete conversion to ErrorOr<void>, but a good chunk. The end goal here is to propagate buffer allocation failures to the caller, and allow the use of TRY() with formatting functions.
2021-11-16LibPDF: Check if there is data left before consumingSimon Woertz
Add a check to `Parser::consume_eol` to ensure that there is more data to read before actually consuming any data. Not checking if there is data left leads to failing an assertion in case of e.g., a truncated pdf file.
2021-11-11Everywhere: Pass AK::ReadonlyBytes by valueAndreas Kling
2021-11-10AK: Make ByteBuffer::try_* functions return ErrorOr<void>Andreas Kling
Same as Vector, ByteBuffer now also signals allocation failure by returning an ENOMEM Error instead of a bool, allowing us to use the TRY() and MUST() patterns.
2021-10-30LibPDF: Parser::parse_header() return false if remaining bytes is zeroBrendan Coles
2021-09-20LibPDF: Replace Value class by AK::VariantBen Wiederhake
This decreases the memory consumption by LibPDF by 4 bytes per Value, compensating exactly for the increase in an earlier commit. :^)
2021-09-20LibPDF: Extract reference bitpacking into dedicated classBen Wiederhake
2021-09-20LibPDF: Move inline function definitionBen Wiederhake
This breaks the dependency cycle between Parser and Document.
2021-09-06Everywhere: Make ByteBuffer::{create_*,copy}() OOM-safeAli Mohammad Pur
2021-09-06Everywhere: Use OOM-safe ByteBuffer APIs where possibleAli Mohammad Pur
If we can easily communicate failure, let's avoid asserting and report failure instead.
2021-07-19Everywhere: Use AK/Math.h if applicableHendiadyoin1
AK's version should see better inlining behaviors, than the LibM one. We avoid mixed usage for now though. Also clean up some stale math includes and improper floatingpoint usage.
2021-07-16LibPDF: Fix treating not finding the linearized dict as a fatal errorWesley Moret
We now try to parse the first indirect value and see if it's the `Linearization Parameter Dictionary`. if it's not, we fallback to reading the xref table from the end of the document
2021-07-16LibPDF: Fix checking `minor_ver` instead of `major_ver`Wesley Moret
2021-06-12LibPDF: Convert to east-const to comply with the recent style changesMatthew Olsson
2021-06-12LibPDF: Parse hint tablesMatthew Olsson
This code isn't _actually_ used as of right now, but I wrote it at the same time as all of the code in the previous commit. I realized after I wrote it that these hint tables aren't super useful if the parser already has access to the full file. However, this will be useful if we ever want to stream PDFs from the web (and possibly view them in the browser).