File Carving and Magic Bytes: Repairing Corrupted Files for CTF

The file won't open. Here is the 60-second triage

You downloaded the challenge file. You double-clicked it. Nothing. The image viewer says "not a valid image," or the archive tool says "unexpected end of file," or it opens to garbage. Do not panic and do not reach for a hex editor yet. Run these three commands in order, because they answer different questions and each one takes a second:

$ file mystery.bin          # what does the OS think this is?
mystery.bin: data           # 'data' means the magic bytes are wrong or missing
$ xxd mystery.bin | head     # look at the raw first 16 bytes per row
00000000: 0000 0000 4948 4452 0000 0190 0000 012c  ....IHDR.......,
$ strings -n 8 mystery.bin | head   # any human-readable text, URLs, flags?
IHDR
tEXtComment
picoCTF{not_this_easy_but_sometimes}

That is the whole opening move. file tells you whether the header matches a known type. xxd shows you the actual bytes so you can compare them to what the header should be. strings pulls out readable text, which is where lazy flags and format hints hide. In the example above, file reported data (it could not identify the type) but xxd clearly shows an IHDR chunk four bytes in. That is a PNG with its 8-byte signature zeroed out. You already know the fix before you have done any real work: write the correct magic bytes back to the front of the file.

A "broken" file is almost never random damage. It is a known format with a deliberately edited header, an appended second file, or a missing chunk. Forensics is pattern recognition, not luck.

The rest of this guide is the toolbox for each of those three cases: patching a header by its magic bytes, repairing a structurally broken PNG, and pulling embedded or appended files out with binwalk and foremost. If you want the deep dive on reading raw bytes first, the Hex Dumps for CTF post covers xxd and endianness in detail, and the file magic identifier tool will name a type from a hex prefix you paste in.

What are magic bytes, and why does file say data?

File extensions are a lie your file manager tells you. The operating system does not trust .png or .pdf; it reads the first few bytes of the file and matches them against a database of known signatures. Those leading bytes are the magic number, and most formats put a fixed, recognizable value there on purpose so that tools can identify the content regardless of what the file is named. The file command uses a compiled database (the magic file) of exactly these patterns.

When file prints data, it means none of the signatures it knows matched the start of your file. In a CTF that is a signal, not a dead end: the header was edited. Here is the short list worth memorizing, because these are the formats that show up in forensics challenges over and over:

Format	Magic (hex)	ASCII	Notes
PNG	89 50 4E 47 0D 0A 1A 0A	.PNG....	Full 8-byte signature; ends with IEND chunk
JPEG	FF D8 FF	...	Ends with FF D9; next byte is E0/E1/DB
GIF	47 49 46 38 39 61	GIF89a	Also GIF87a (...37 61)
PDF	25 50 44 46	%PDF	Ends with %%EOF
ZIP	50 4B 03 04	PK..	Also docx, xlsx, jar, apk (all ZIP)
ELF	7F 45 4C 46	.ELF	Linux binaries and core dumps
PCAP	D4 C3 B2 A1	....	Little-endian; A1 B2 C3 D4 is big-endian
PCAPNG	0A 0D 0D 0A	....	Section Header Block at offset 8 holds D4 C3 B2 A1 variant

Key insight: Many formats also have a known trailer, not just a header. PNG ends with an IEND chunk, JPEG ends with FF D9, PDF ends with %%EOF, and a ZIP ends with the End of Central Directory record (50 4B 05 06). When you carve or repair, the trailer tells you where the real file stops, which matters the moment a second file has been appended.

For anything not in the table, the canonical reference is Gary Kessler's File Signatures table, which lists thousands of header and trailer patterns. Keep it open during a forensics challenge. If you would rather paste bytes and get an answer, the on-site hex viewer and strings extractor do the same first-look without leaving the browser.

How do I fix a corrupted header by patching the magic bytes?

Once xxd has told you which format the file should be, repairing a wrong or missing header is a single byte-level edit. There are three reliable ways to do it, from quickest to most controlled.

1. Overwrite in place with a hex editor. Open the file in hexedit, navigate to offset 0, type the correct hex over the wrong bytes (press a key to toggle between hex and ASCII panes), and save. This is the right move when the file length is already correct and only the signature is wrong, for example a PNG whose first 8 bytes were zeroed:

$ hexedit mystery.png
# move cursor to offset 0, overwrite with:  89 50 4E 47 0D 0A 1A 0A
# Ctrl-X to save and exit
$ file mystery.png
mystery.png: PNG image data, 400 x 300, 8-bit/color RGBA, non-interlaced

2. Surgical patch with dd. When you want a repeatable, scriptable edit (and no interactive editor), write the bytes with printf and splice them in with dd. The conv=notrunc flag is the part people forget: without it, dd truncates the rest of your file.

# write the 8-byte PNG signature over the first 8 bytes, in place
$ printf '\x89\x50\x4e\x47\x0d\x0a\x1a\x0a' | \
    dd of=mystery.png bs=1 seek=0 conv=notrunc
8+0 records in
8+0 records out

3. Prepend bytes that are missing entirely. If the header was deleted (not overwritten), the file is too short by however many bytes the signature is, and overwriting offset 0 would clobber real data. In that case you concatenate the missing prefix in front of the existing bytes:

# the JPEG lost its leading FF D8 FF; prepend them, keep the body intact
$ printf '\xff\xd8\xff' | cat - body.jpg > fixed.jpg
$ file fixed.jpg
fixed.jpg: JPEG image data, JFIF standard 1.01

Warning: Decide whether the header was overwritten (file length is right, use seek with conv=notrunc) or removed (file is short, prepend). Guessing wrong shifts every subsequent byte and produces a file that still will not open. Compare the file size against a known-good sample of the same dimensions when you are unsure.

The picoCTF challenge picoCTF 2019 extensions is the gentlest possible introduction to this idea: a file is misnamed, and file plus a rename is the whole solution. From there, picoCTF 2024 Scan Surprise rewards simply trusting file over the extension and treating the bytes as what they actually are.

How do I repair a structurally broken PNG?

PNG is the format CTF authors love to break, because its structure is rich enough to hide damage in several places at once. A PNG is the 8-byte signature followed by a sequence of chunks, and each chunk has the same shape: a 4-byte big-endian length, a 4-byte type tag (IHDR, IDAT, IEND, and so on), the data, then a 4-byte CRC-32 over the type and data. The authoritative layout is in the W3C PNG specification, and it is worth skimming once so the chunk names below mean something.

The four places a PNG challenge typically breaks, and the fix for each:

The signature. Zeroed or altered first 8 bytes. Patch them back to 89 50 4E 47 0D 0A 1A 0A as shown above.
The IHDR dimensions. The IHDR chunk holds width and height as two 4-byte big-endian integers right after the type tag. Authors shrink the height to 1 pixel to hide the image. Edit those bytes to the real dimensions and the picture reappears.
A bad CRC. Editing any chunk data invalidates its CRC, and strict viewers refuse to render. Recompute it, or use a repair tool that fixes CRCs in bulk.
A missing IEND. The trailing IEND chunk (00 00 00 00 49 45 4E 44 AE 42 60 82, a fixed 12-byte sequence) marks the end. If it is gone, append it.

You do not have to find the CRC errors by eye. The PNG toolkit pngcheck walks every chunk and tells you exactly which one is malformed and why:

$ pngcheck -v broken.png
File: broken.png (12044 bytes)
  chunk IHDR at offset 0x0000c, length 13
    400 x 1 image, 32-bit RGB+alpha, non-interlaced   <-- height of 1 is suspicious
  chunk IDAT at offset 0x00025, length 8192
    CRC error in chunk IDAT (computed 7a3f, expected 0000)
ERROR: broken.png

Here pngcheck has handed you the whole solution: the height is 1 (it should be larger), and there is a CRC error. Open the file in a hex editor, find the IHDR data (it starts right after the type tag at offset 0x10), and correct the height field. The width is the four bytes before it, so if the image is 400 wide the next four bytes are the height. Set them to a plausible value and re-run pngcheck.

Tip: You can patch the height blind. Set it to something tall, like 0x00 0x00 0x05 0x00 (1280), and most viewers will render whatever real rows exist and pad or ignore the rest. The flag is usually visible long before the true height, so an overestimate works. Many viewers tolerate a wrong CRC entirely, so fixing dimensions alone often reveals the flag without touching the checksum.

For the corruption that goes beyond a single field, picoCTF 2019 c0rrupt is the canonical exercise: the file's signature and several chunk fields are all damaged, and you rebuild them one at a time using pngcheck and a hex editor until the image renders. It is the single best practice target for everything in this section.

What is hidden inside this file? Finding embedded and appended data

A file that opens correctly can still be hiding a second file. The most common trick in forensics challenges is appending: a valid JPEG, then a ZIP archive glued to its end. The image viewer reads up to the JPEG trailer and stops, so it looks normal, but the archive sits in the bytes after it. binwalk scans the entire file for every known magic signature, not just the one at offset 0, and reports each one it finds:

$ binwalk innocent.jpg
DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             JPEG image data, JFIF standard 1.01
12387         0x3063          Zip archive data, at least v2.0 to extract
12511         0x30DF          End of Zip archive, footer length: 22

That output says there is a ZIP starting at byte 12387. To pull the embedded files out, add the -e flag (extract using its signature rules) or, more reliably for archives, the --dd flag with an explicit type. binwalk drops everything into a _innocent.jpg.extracted/ directory:

$ binwalk -e innocent.jpg          # auto-extract known types
$ ls _innocent.jpg.extracted/
30DF.zip   flag.txt
# if -e is stubborn, carve the region yourself with dd
$ dd if=innocent.jpg of=hidden.zip bs=1 skip=12387
$ unzip hidden.zip

Note: binwalk -e is signature-based, so it occasionally reports false positives (a random run of bytes that happens to look like a gzip header). Trust the entries that line up with a plausible offset and a matching trailer; treat a lone, sizeless hit in the middle of compressed image data with suspicion. The binwalk project page documents the full signature and extraction options.

One more reflex: when you suspect appended data but want the simplest possible extraction, you do not even need binwalk for a ZIP. unzip and 7z will both scan forward to find the central directory and extract a ZIP that is glued to the back of an image, ignoring the leading bytes. 7z x image.jpg often just works.

This appended-archive pattern is exactly what picoCTF 2019 like1000 builds on: nested TAR archives, each one wrapping the next, peeled open layer after layer. For the same idea applied to network captures and embedded streams, the Wireshark and PCAP post covers extracting transferred files out of a capture.

How do I carve files out of a raw blob by signature?

When you have a disk image, a memory dump, or one large undifferentiated blob and you do not know where the interesting files are, you carve. File carving means scanning the raw bytes for headers and trailers and writing out everything between them as a reconstructed file, ignoring any filesystem metadata entirely. Two tools own this job: foremost and scalpel.

foremost is the quickest to run. Point it at the blob, give it an output directory, and it sorts everything it recovers into per-type subfolders:

$ foremost -i disk.img -o carved/
$ ls carved/
audit.txt   jpg/   png/   pdf/   zip/
$ cat carved/audit.txt
Num   Name (bs=512)        Size     File Offset     Comment
0:    00000048.jpg         88 KB    24576
1:    00001102.png         12 KB    564224
2:    00002310.pdf        301 KB   1182720

You can restrict carving to the types you care about with -t (for example -t jpg,png,pdf,zip), which is faster and cuts down on false hits. The audit file records the byte offset each file was carved from, which is useful when you need to go back and look at the surrounding bytes by hand. The foremost project page lists the built-in signatures and the config-file format for adding your own.

scalpel is a faster, more configurable descendant of the same idea. You edit scalpel.conf to enable the signatures you want (each line is a header, an optional footer, and a maximum carve size), then run it against the blob:

# enable a line in /etc/scalpel/scalpel.conf, e.g.:
#   png   y   20000000   \x89PNG\x0d\x0a\x1a\x0a   \x49\x45\x4e\x44\xae\x42\x60\x82
$ scalpel disk.img -o carved_scalpel/
$ ls carved_scalpel/
png-0-0/   zip-2-0/   audit.txt

Key insight: Carving recovers files even when the filesystem is gone, but it can only find a file whose bytes are contiguous. A fragmented file (split across non-adjacent regions) carves out broken, because the tool grabs everything from header to the first matching trailer regardless of fragmentation. If a carved JPEG is half-corrupt, that is usually fragmentation, not a tool bug. For full-disk-image work, pair carving with a filesystem walk as described in the disk forensics post.

The disk-image flavor of this shows up in picoCTF 2021 Disk, disk, sleuth, where you mount or carve a raw image to recover the hidden file. Carving is the fallback for when the structured filesystem tools do not surface what you are after.

What is a polyglot file, and how do I spot one?

A polyglot is a single file that is simultaneously valid as two or more formats. The classic is a file that is both a valid PDF and a valid ZIP, or a GIF that is also runnable JavaScript. They work by exploiting a gap in how each format defines its boundaries: a ZIP is located by its central directory at the end of the file, while a PDF or a GIF is parsed from the start, so a single byte stream can satisfy both parsers at once.

A polyglot is not corruption. It is a file that tells the truth to two different parsers at the same time.

You spot a polyglot the same way you spot appended data: file names one format, but binwalk or a second tool reports another signature inside the same bytes. The tell is that the second signature is not a random false positive but a fully formed file with both a header and a trailer. When that happens, try opening the file as each type in turn:

$ file weird.gif
weird.gif: GIF image data, version 89a, 320 x 240
$ binwalk weird.gif
0      0x0     GIF image data
5008   0x1390  Zip archive data, at least v1.0 to extract
$ unzip weird.gif        # the GIF is also a ZIP; just extract it
Archive:  weird.gif
  inflating: secret/flag.txt

The defensive lesson is the same as the offensive one: never trust a file's extension or even its file output as the final word. A file can wear two faces. If something feels too clean, run it through a second parser. Steganography challenges lean on this constantly, and the steganography post covers the adjacent tricks (data hidden in pixel bits, in metadata, in trailing bytes) that pair naturally with polyglots.

Which picoCTF challenges build this muscle?

The picoCTF forensics track is a near-perfect ladder for the skills above, from one-command renames up to multi-field PNG reconstruction. Work them roughly in this order:

picoCTF 2019 extensions and picoCTF 2024 Scan Surprise: trust file over the extension. The whole point is to stop believing the filename.
picoCTF 2019 c0rrupt: the deep PNG repair. Damaged signature and chunk fields rebuilt with pngcheck and a hex editor. This is the one that teaches the chunk format for real.
picoCTF 2019 like1000: nested archives. Pure peeling practice for binwalk and unzip.
picoCTF 2021 Disk, disk, sleuth: carving and filesystem recovery on a raw image, the natural home for foremost.

Do them in order and you will have touched every technique in this guide on a real target: identify, patch, repair, extract, carve. After that, the next blob that file calls data is a puzzle you already know how to start.

Quick reference

Triage order for any unidentified file

file mystery.bin then xxd mystery.bin | head then strings -n 8 mystery.bin. Compare the first bytes to the table below.
Header wrong but length right? Patch in place: printf '\x89\x50...' | dd of=f bs=1 seek=0 conv=notrunc.
Header missing entirely? Prepend: printf '\xff\xd8\xff' | cat - body > fixed.jpg.
PNG that opens to garbage? pngcheck -v, fix IHDR dimensions and CRC, append IEND if missing.
File opens fine but feels heavy? binwalk f for appended data, then binwalk -e f or 7z x f to extract.
Raw blob or disk image? foremost -i blob -o carved/ to carve every known type by signature.

Format	Header (hex)	Trailer (hex)
PNG	89 50 4E 47 0D 0A 1A 0A	49 45 4E 44 AE 42 60 82
JPEG	FF D8 FF	FF D9
GIF	47 49 46 38 39 61	3B
PDF	25 50 44 46	25 25 45 4F 46
ZIP	50 4B 03 04	50 4B 05 06 (EOCD)
ELF	7F 45 4C 46	none (section-defined)
PCAP	D4 C3 B2 A1	none (record stream)

The whole discipline collapses to one habit: read the first sixteen bytes before you do anything else, because a corrupted file is just a known format wearing the wrong hat.