PDF File analysis

From: https://trailofbits.github.io/ctf/forensics/arrow-up-right

PDF is an extremely complicated document file format, with enough tricks and hiding places to write about for yearsarrow-up-right. This also makes it popular for CTF forensics challenges. The NSA wrote a guide to these hiding places in 2008 titled "Hidden Data and Metadata in Adobe PDF Files: Publication Risks and Countermeasures." It's no longer available at its original URL, but you can find a copy herearrow-up-right. Ange Albertini also keeps a wiki on GitHub of PDF file format tricksarrow-up-right.

The PDF format is partially plain-text, like HTML, but with many binary "objects" in the contents. Didier Stevens has written good introductory materialarrow-up-right about the format. The binary objects can be compressed or even encrypted data, and include content in scripting languages like JavaScript or Flash. To display the structure of a PDF, you can either browse it with a text editor, or open it with a PDF-aware file-format editor like Origami.

qpdfarrow-up-right is one tool that can be useful for exploring a PDF and transforming or extracting information from it. Another is a framework in Ruby called Origamiarrow-up-right.

When exploring PDF content for hidden data, some of the hiding places to check include:

  • non-visible layers

  • Adobe's metadata format "XMP"

  • the "incremental generation" feature of PDF wherein a previous version is retained but not visible to the user

  • white text on a white background

  • text behind images

  • an image behind an overlapping image

  • non-displayed comments

There are also several Python packages for working with the PDF file format, like PeepDFarrow-up-right, that enable you to write your own parsing scripts.

Last updated