File types

Published: July 20, 2023

Description

The provided PDF is actually a shell archive containing multiple nested formats (ar, cpio, bzip2, gzip, lzip, etc.). Extract them sequentially until you reach ASCII text, then hex-decode the contents.

Run the file as a shell archive (`sh Flag.pdf`) to extract `flag`.

Inspect each resulting file with `file` and use the appropriate extractor (ar, cpio, bzip2, gzip, lzip, lz4, lzma, lzop, xz, etc.).

Once the final ASCII file appears, hex-decode it with `xxd -r -p`.

sh Flag.pdf
ar x flag
cpio --file flag.cpio -i
bzip2 flag -d
gunzip flag.gz
lzip flag -d
unlz4 flag.lz4
lzma flag.lzma -d
lzop flag.lzop -d
unxz flag.xz
xxd -r -p flag

Solution

  1. Step 1Peel each layer
    After each extraction, run `file flag` to identify the next compression/container type and use its counterpart to extract again.
    Learn more

    The file command reads a file's magic bytes - the first few bytes of the file - to determine its actual type, regardless of the filename extension. Every file format has a signature: PNG files start with \x89PNG, ZIP files with PK, gzip with \x1f\x8b, and so on. This makes file reliable even when extensions have been changed or stripped, which is exactly what this challenge does.

    Shell archives (shar) are self-extracting scripts that encode file contents as shell commands. Running them with shexecutes the script and recreates the original files. This format predates modern archive tools like tar; it was common in the early Unix era for distributing source code via email or newsgroups. The "PDF" extension here is a red herring - the magic bytes of the file identify it as a shell script.

    The succession of formats in this challenge (ar, cpio, bzip2, gzip, lzip, lz4, lzma, lzop, xz) is a tour through Linux compression and archive history. Each was invented to improve on prior tools in speed, compression ratio, or patent-freedom. xz and lzma are the most modern and achieve the best compression ratios; gzip remains the most widely deployed due to its long history.

  2. Step 2Decode the hex
    The final file is ASCII hex; `xxd -r -p` converts it back into the readable picoCTF flag.
    Learn more

    Hexadecimal encoding represents each byte as two hex digits (0-9, a-f), so a 10-byte file becomes a 20-character hex string. It's commonly used to display binary data in a human-readable, copy-pasteable form. The xxd tool both creates hex dumps (xxd file) and reverses them (xxd -r -p reads plain hex and outputs raw bytes).

    The -p flag tells xxd to use "plain" hex format - just the hex digits with no address offsets or ASCII sidebar. This is the format produced by tools like Python's bytes.hex() and is the cleanest form to pipe between tools.

    Nested compression challenges like this one teach you to systematically identify and peel container formats rather than panicking when a file isn't what its extension claims. In real forensics work, files are frequently renamed or given wrong extensions to obscure their nature - the file/magic bytes approach is always the authoritative check.

Flag

picoCTF{f1len@m3_m@n1pul@t10n_f0r_0b2cur17y_3c7...}

Automating the extraction loop with `while file flag | grep ...` can save time on nested compression challenges.

Want more picoCTF 2022 writeups?

Useful tools for Forensics

Related reading

What to try next