(Note: This advisory can also be found at http://pdfsig-collision.florz.de/)
The specification of the Portable Document Format (PDF) from version
1.3 onward, including ISO 19005-1:2005 (PDF/A-1) and ISO 32000-1:2008
(equivalent to PDF 1.7), ostensibly defines a mechanism for digitally
signing a document's contents so as to integrate cryptographic
authentication of a document's contents into the existing container
format. A common use of this mechanism is for the creation of supposedly
non-repudiable signatures on legal documents, including scenarios where
digital signatures are mandated by law.
This advisory shows how a signed PDF document can be constructed in such a
way that its appearance can be changed without necessarily invalidating the
It is not entirely clear whether the files provided as a demonstration of
the vulnerability can actually be considered (syntactically valid) PDF
documents or not--I haven't found a cleaner way so far. Also, the
demonstration documents do not work with all implementations in the same
way--however, I would argue that the mere fact that implementations
(and in at least one case even two different interfaces to what seems to
be the same implementation) don't agree on how to interpret a document
and its signing status while not being in conflict with the specification
in any obvious way is sufficient evidence that at the very least the
specification is lacking.
That said, my opinion is that the mechanism is fundamentally flawed and
ought to be replaced. Also, the main point of this advisory is not the
practical demonstration (even though it takes up most of the space),
but rather the theoretical deficiency of the specification, which is
only shown by the demonstration to not be purely theoretical in nature.
= The Problem
The PDF specification does not itself specify any of the cryptographic
operations to be performed for signing or for the verification of
signatures, but instead limits itself to providing a framework that any
signature mechanism can be plugged into.
The specification defines how the document is to be serialized into the
sequence of bytes that is fed into the signature mechanism, it defines the
way the resulting signature blob along with possible mechanism-specific
signature meta data is to be stored within the file, and it specifies
a marker that is used to distinguish between signature mechanisms.
Practically, PKCS#7 seems to be the prevalent signature mechanism in use,
thus building on proven, well-documented, and well-understood technology
for the core cryptographic operations.
The problem lies in the serialization step: The transformation that
creates the byte sequence that is fed into the signature mechanism is
non-injective, and, of course, not collision resistant, thus allowing
for the construction of colliding documents.
= How PDF Signatures (Don't) Work
Note: In the following, it is assumed that you are familiar with the
structure of PDF files. If you are not, you can find a short introduction
below that covers everything that you need to know in order to understand
The data structure at the heart of the digital signature framework is
a "signature dictionary" like this one (defined under 12.8.1 in
/Contents <12[lots of hex digits ...]ef>
/ByteRange [0 123 456 789]
The SubFilter field indicates the signature mechanism used, the Contents
field stores the signature blob produced by the signature mechanism as
a hexadecimal string, and the ByteRange field specifies the regions
of the file that are covered by the signature (it's a list of pairs,
where each pair specifies a start offset and the number of bytes to
include starting at that offset--it should, as per the specification,
cover the whole file, excluding only exactly the value of the Contents
field of the signature dictionary).
The serialization of the document that's fed into the signature mechanism
is exactly bit-identical to the serialization in the final (signed) file
(including the cross-reference table), with the only difference that
the range of bytes that is occupied by the value of the Contents field
in the final file is left out--or, to put it another way: It is the
concatenation of those byte ranges of the final file that are specified
by the ByteRange field.
To simplify the reasoning about the attack, a model of a "byte sequence
with a gap of a defined size" will be used in the following--the "gap"
being an object that occupies address space in the sequence but doesn't
have any properties besides that.
Using this model, one can describe the process of the creation of a
signed PDF file as follows:
First, the document is serialized into such a byte sequence with a
gap of appropriate size inserted in the location where the value of
the signature dictionary's Contents field would belong in the final
file. This is called the "preliminary serialization" in the following.
The preliminary serialization then is transformed into the byte sequence
that is fed into the signature mechanism by simply dropping the gap from
the sequence. This is called the "signing serialization" in the following.
Finally, the final, signed file is created by replacing the gap in the
preliminary serialization with the signature blob that has been created
by the signature mechanism.
= The Attack in Detail
Using the model described above, if we assume the values of the bytes in
the sequence to be opaque to us, it is obvious that the transformation
=66rom the preliminary to the signing serialization is not injective, as
there is no way for reconstructing the gap from the signing serialization.
If we do allow for the contents of the sequence to be used for the
reconstruction of the gap, at first glance it may seem as if the
ByteRange field provided the needed information (position and size of
the gap)--however, in order to locate the signature dictionary with
the ByteRange field in it, one first would have to reconstruct the gap,
as the offsets of the indirect objects one has to traverse in order to
find the signature dictionary (such as the document catalog) may depend
on the size of the gap (depending on the object's position relative to
the gap). This circular dependency can not be broken in the general case,
as is demonstrated by the following example.
So, the idea of the attack is to create a pair of preliminary
serializations that differ only in the position of the gap (thus
resulting in colliding signing serializations) with two alternative
document catalogs in between the two possible gap locations, so that
depending on which of the two locations the signature blob is injected at,
either one or the other of the two catalogs will be moved to the offset
that the cross-reference table entry for the document catalog points to.
Schematically, the two preliminary serializations look like this: