Age | Commit message (Collapse) | Author |
|
|
|
|
|
An operation like 0 0 0 RG would have been confused for [ 0, 0 0 R ] G
|
|
The Parser class is now a generic PDF object parser, of which the new
DocumentParser class derives. DocumentParser now takes over all
functions relating to linearization, pages, xref and trailer handling.
This allows the use of multiple parsers in the same document's
context, which will be needed in order to handle PDF object streams.
|
|
|
|
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
|
|
|
|
This is the correct name, according to the spec
|
|
|
|
This makes it a bit easier to avoid calling parser->set_document, an
issue which cost me ~30 minutes to find.
|
|
This information is required to decrypt encrypted strings/streams.
|
|
This was a small optimization to allow a stream object to simply hold
a reference to the bytes in a PDF document rather than duplicating
them. However, as we move into features such as encryption, this
optimization does more harm than good. This can be revisited in the
future if necessary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Apologies for the enormous commit, but I don't see a way to split this
up nicely. In the vast majority of cases it's a simple change. A few
extra places can use TRY instead of manual error checking though. :^)
|
|
Otherwise both `PDF::Document` and `PDF::Parser` have a `RefPtr`
pointing to each other which leads to a memory leak due to a circular
dependency.
|
|
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
|
|
Add a check to `Parser::consume_eol` to ensure that there is more data
to read before actually consuming any data. Not checking if there is
data left leads to failing an assertion in case of e.g., a truncated
pdf file.
|
|
|
|
Same as Vector, ByteBuffer now also signals allocation failure by
returning an ENOMEM Error instead of a bool, allowing us to use the
TRY() and MUST() patterns.
|
|
|
|
This decreases the memory consumption by LibPDF by 4 bytes per Value,
compensating exactly for the increase in an earlier commit. :^)
|
|
|
|
This breaks the dependency cycle between Parser and Document.
|
|
|
|
If we can easily communicate failure, let's avoid asserting and report
failure instead.
|
|
AK's version should see better inlining behaviors, than the LibM one.
We avoid mixed usage for now though.
Also clean up some stale math includes and improper floatingpoint usage.
|
|
We now try to parse the first indirect value and see
if it's the `Linearization Parameter Dictionary`. if it's not, we
fallback to reading the xref table from the end of the document
|
|
|
|
|
|
This code isn't _actually_ used as of right now, but I wrote it at the
same time as all of the code in the previous commit. I realized after
I wrote it that these hint tables aren't super useful if the parser
already has access to the full file. However, this will be useful if
we ever want to stream PDFs from the web (and possibly view them in
the browser).
|
|
This is a big step, as most PDFs which are downloaded online will be
linearized. Pretty much the only difference is that the xref structure
is slightly different.
|
|
- A newline was assumed to follow the "stream" keyword, when it can also
be a windows-style line break
- Fix not consuming the "endobj" at the end of every indirect object
|
|
The Parser should hold information relevant for parsing, whereas the
Document should hold information relevant for displaying pages.
With this in mind, there is no reason for the Document to hold the
xref table and trailer. These objects have been moved to the Parser,
which allows the Parser to expose less public methods (which will be
even more evident once linearized PDFs are supported).
|
|
|
|
|
|
Previously, AK::Function would accept _any_ callable type, and try to
call it when called, first with the given set of arguments, then with
zero arguments, and if all of those failed, it would simply not call the
function and **return a value-constructed Out type**.
This lead to many, many, many hard to debug situations when someone
forgot a `const` in their lambda argument types, and many cases of
people taking zero arguments in their lambdas to ignore them.
This commit reworks the Function interface to not include any such
surprising behaviour, if your function instance is not callable with
the declared argument set of the Function, it can simply not be
assigned to that Function instance, end of story.
|
|
|
|
Strings can be encoded in either UTF16-BE or UTF8. In either case,
there are a few initial bytes which specify the encoding that must
be checked and also removed from the final string.
|
|
IndirectValueRef is so simple that it can be stored directly in the
Value class instead of being heap allocated.
As the comment in Value says, however, in theory the max bits needed to
store is 48 (16 for the generation index and 32(?) for the object
index), but 32 should be good enough for now. We can increase it to u64
later if necessary.
|
|
This commit also splits up StreamObject into PlainTextStreamObject and
EncodedStreamObject, which is essentially just a stream object which
does not own its bytes vs one which does.
|
|
|
|
|
|
Some PDFs omit this key apparently, but Firefox opens them fine. Let's
emulate that behavior.
|
|
We now follow nested page tree nodes to find all of the actual
page dicts, whereas previously we just assumed the root level
page tree node contained all of the page children directly.
|
|
This commit introduces the ability to parse the document catalog dict,
as well as the page tree and individual pages. Pages obviously aren't
fully parsed, as we won't care about most of the fields until we
start actually rendering PDFs.
One of the primary benefits of the PDF format is laziness. PDFs are
not meant to be parsed all at once, and the same is true for pages.
When a Document is constructed, it builds a map of page number to
object index, but it does not fetch and parse any of the pages. A page
is only parsed when a caller requests that particular page (and is
cached going forwards).
Additionally, this commit also adds an object_cast function which
logs bad casts if DEBUG_PDF is set. Additionally, utility functions
were added to ArrayObject and DictObject to get all types of objects
from the collections to avoid having to manually cast.
|