The file won't open. Here is the 60-second triage
You downloaded the challenge file. You double-clicked it. Nothing. The image viewer says "not a valid image," or the archive tool says "unexpected end of file," or it opens to garbage. Do not panic and do not reach for a hex editor yet. Run these three commands in order, because they answer different questions and each one takes a second:
$ file mystery.bin # what does the OS think this is?mystery.bin: data # 'data' means the magic bytes are wrong or missing$ xxd mystery.bin | head # look at the raw first 16 bytes per row00000000: 0000 0000 4948 4452 0000 0190 0000 012c ....IHDR.......,$ strings -n 8 mystery.bin | head # any human-readable text, URLs, flags?IHDRtEXtCommentpicoCTF{not_this_easy_but_sometimes}
That is the whole opening move. file tells you whether the header matches a known type. xxd shows you the actual bytes so you can compare them to what the header should be. strings pulls out readable text, which is where lazy flags and format hints hide. In the example above, file reported data (it could not identify the type) but xxd clearly shows an IHDR chunk four bytes in. That is a PNG with its 8-byte signature zeroed out. You already know the fix before you have done any real work: write the correct magic bytes back to the front of the file.
A "broken" file is almost never random damage. It is a known format with a deliberately edited header, an appended second file, or a missing chunk. Forensics is pattern recognition, not luck.
The rest of this guide is the toolbox for each of those three cases: patching a header by its magic bytes, repairing a structurally broken PNG, and pulling embedded or appended files out with binwalk and foremost. If you want the deep dive on reading raw bytes first, the Hex Dumps for CTF post covers xxd and endianness in detail, and the file magic identifier tool will name a type from a hex prefix you paste in.
What are magic bytes, and why does file say data?
File extensions are a lie your file manager tells you. The operating system does not trust .png or .pdf; it reads the first few bytes of the file and matches them against a database of known signatures. Those leading bytes are the magic number, and most formats put a fixed, recognizable value there on purpose so that tools can identify the content regardless of what the file is named. The file command uses a compiled database (the magic file) of exactly these patterns.
When file prints data, it means none of the signatures it knows matched the start of your file. In a CTF that is a signal, not a dead end: the header was edited. Here is the short list worth memorizing, because these are the formats that show up in forensics challenges over and over:
| Format | Magic (hex) | ASCII | Notes |
|---|---|---|---|
| PNG | 89 50 4E 47 0D 0A 1A 0A | .PNG.... | Full 8-byte signature; ends with IEND chunk |
| JPEG | FF D8 FF | ... | Ends with FF D9; next byte is E0/E1/DB |
| GIF | 47 49 46 38 39 61 | GIF89a | Also GIF87a (...37 61) |
| 25 50 44 46 | Ends with %%EOF | ||
| ZIP | 50 4B 03 04 | PK.. | Also docx, xlsx, jar, apk (all ZIP) |
| ELF | 7F 45 4C 46 | .ELF | Linux binaries and core dumps |
| PCAP | D4 C3 B2 A1 | .... | Little-endian; A1 B2 C3 D4 is big-endian |
| PCAPNG | 0A 0D 0D 0A | .... | Section Header Block at offset 8 holds D4 C3 B2 A1 variant |
IEND chunk, JPEG ends with FF D9, PDF ends with %%EOF, and a ZIP ends with the End of Central Directory record (50 4B 05 06). When you carve or repair, the trailer tells you where the real file stops, which matters the moment a second file has been appended.For anything not in the table, the canonical reference is Gary Kessler's File Signatures table, which lists thousands of header and trailer patterns. Keep it open during a forensics challenge. If you would rather paste bytes and get an answer, the on-site hex viewer and strings extractor do the same first-look without leaving the browser.
How do I fix a corrupted header by patching the magic bytes?
Once xxd has told you which format the file should be, repairing a wrong or missing header is a single byte-level edit. There are three reliable ways to do it, from quickest to most controlled.
1. Overwrite in place with a hex editor. Open the file in hexedit, navigate to offset 0, type the correct hex over the wrong bytes (press a key to toggle between hex and ASCII panes), and save. This is the right move when the file length is already correct and only the signature is wrong, for example a PNG whose first 8 bytes were zeroed:
$ hexedit mystery.png# move cursor to offset 0, overwrite with: 89 50 4E 47 0D 0A 1A 0A# Ctrl-X to save and exit$ file mystery.pngmystery.png: PNG image data, 400 x 300, 8-bit/color RGBA, non-interlaced
2. Surgical patch with dd. When you want a repeatable, scriptable edit (and no interactive editor), write the bytes with printf and splice them in with dd. The conv=notrunc flag is the part people forget: without it, dd truncates the rest of your file.
# write the 8-byte PNG signature over the first 8 bytes, in place$ printf '\x89\x50\x4e\x47\x0d\x0a\x1a\x0a' | \dd of=mystery.png bs=1 seek=0 conv=notrunc8+0 records in8+0 records out
3. Prepend bytes that are missing entirely. If the header was deleted (not overwritten), the file is too short by however many bytes the signature is, and overwriting offset 0 would clobber real data. In that case you concatenate the missing prefix in front of the existing bytes:
# the JPEG lost its leading FF D8 FF; prepend them, keep the body intact$ printf '\xff\xd8\xff' | cat - body.jpg > fixed.jpg$ file fixed.jpgfixed.jpg: JPEG image data, JFIF standard 1.01
seek with conv=notrunc) or removed (file is short, prepend). Guessing wrong shifts every subsequent byte and produces a file that still will not open. Compare the file size against a known-good sample of the same dimensions when you are unsure.The picoCTF challenge picoCTF 2019 extensions is the gentlest possible introduction to this idea: a file is misnamed, and file plus a rename is the whole solution. From there, picoCTF 2024 Scan Surprise rewards simply trusting file over the extension and treating the bytes as what they actually are.
How do I repair a structurally broken PNG?
PNG is the format CTF authors love to break, because its structure is rich enough to hide damage in several places at once. A PNG is the 8-byte signature followed by a sequence of chunks, and each chunk has the same shape: a 4-byte big-endian length, a 4-byte type tag (IHDR, IDAT, IEND, and so on), the data, then a 4-byte CRC-32 over the type and data. The authoritative layout is in the W3C PNG specification, and it is worth skimming once so the chunk names below mean something.
The four places a PNG challenge typically breaks, and the fix for each:
- The signature. Zeroed or altered first 8 bytes. Patch them back to
89 50 4E 47 0D 0A 1A 0Aas shown above. - The IHDR dimensions. The
IHDRchunk holds width and height as two 4-byte big-endian integers right after the type tag. Authors shrink the height to 1 pixel to hide the image. Edit those bytes to the real dimensions and the picture reappears. - A bad CRC. Editing any chunk data invalidates its CRC, and strict viewers refuse to render. Recompute it, or use a repair tool that fixes CRCs in bulk.
- A missing IEND. The trailing
IENDchunk (00 00 00 00 49 45 4E 44 AE 42 60 82, a fixed 12-byte sequence) marks the end. If it is gone, append it.
You do not have to find the CRC errors by eye. The PNG toolkit pngcheck walks every chunk and tells you exactly which one is malformed and why:
$ pngcheck -v broken.pngFile: broken.png (12044 bytes)chunk IHDR at offset 0x0000c, length 13400 x 1 image, 32-bit RGB+alpha, non-interlaced <-- height of 1 is suspiciouschunk IDAT at offset 0x00025, length 8192CRC error in chunk IDAT (computed 7a3f, expected 0000)ERROR: broken.png
Here pngcheck has handed you the whole solution: the height is 1 (it should be larger), and there is a CRC error. Open the file in a hex editor, find the IHDR data (it starts right after the type tag at offset 0x10), and correct the height field. The width is the four bytes before it, so if the image is 400 wide the next four bytes are the height. Set them to a plausible value and re-run pngcheck.
0x00 0x00 0x05 0x00 (1280), and most viewers will render whatever real rows exist and pad or ignore the rest. The flag is usually visible long before the true height, so an overestimate works. Many viewers tolerate a wrong CRC entirely, so fixing dimensions alone often reveals the flag without touching the checksum.For the corruption that goes beyond a single field, picoCTF 2019 c0rrupt is the canonical exercise: the file's signature and several chunk fields are all damaged, and you rebuild them one at a time using pngcheck and a hex editor until the image renders. It is the single best practice target for everything in this section.
What is hidden inside this file? Finding embedded and appended data
A file that opens correctly can still be hiding a second file. The most common trick in forensics challenges is appending: a valid JPEG, then a ZIP archive glued to its end. The image viewer reads up to the JPEG trailer and stops, so it looks normal, but the archive sits in the bytes after it. binwalk scans the entire file for every known magic signature, not just the one at offset 0, and reports each one it finds:
$ binwalk innocent.jpgDECIMAL HEXADECIMAL DESCRIPTION--------------------------------------------------------------------------------0 0x0 JPEG image data, JFIF standard 1.0112387 0x3063 Zip archive data, at least v2.0 to extract12511 0x30DF End of Zip archive, footer length: 22
That output says there is a ZIP starting at byte 12387. To pull the embedded files out, add the -e flag (extract using its signature rules) or, more reliably for archives, the --dd flag with an explicit type. binwalk drops everything into a _innocent.jpg.extracted/ directory:
$ binwalk -e innocent.jpg # auto-extract known types$ ls _innocent.jpg.extracted/30DF.zip flag.txt# if -e is stubborn, carve the region yourself with dd$ dd if=innocent.jpg of=hidden.zip bs=1 skip=12387$ unzip hidden.zip
binwalk -e is signature-based, so it occasionally reports false positives (a random run of bytes that happens to look like a gzip header). Trust the entries that line up with a plausible offset and a matching trailer; treat a lone, sizeless hit in the middle of compressed image data with suspicion. The binwalk project page documents the full signature and extraction options.One more reflex: when you suspect appended data but want the simplest possible extraction, you do not even need binwalk for a ZIP. unzip and 7z will both scan forward to find the central directory and extract a ZIP that is glued to the back of an image, ignoring the leading bytes. 7z x image.jpg often just works.
This appended-archive pattern is exactly what picoCTF 2019 like1000 builds on: nested TAR archives, each one wrapping the next, peeled open layer after layer. For the same idea applied to network captures and embedded streams, the Wireshark and PCAP post covers extracting transferred files out of a capture.
How do I carve files out of a raw blob by signature?
When you have a disk image, a memory dump, or one large undifferentiated blob and you do not know where the interesting files are, you carve. File carving means scanning the raw bytes for headers and trailers and writing out everything between them as a reconstructed file, ignoring any filesystem metadata entirely. Two tools own this job: foremost and scalpel.
foremost is the quickest to run. Point it at the blob, give it an output directory, and it sorts everything it recovers into per-type subfolders:
$ foremost -i disk.img -o carved/$ ls carved/audit.txt jpg/ png/ pdf/ zip/$ cat carved/audit.txtNum Name (bs=512) Size File Offset Comment0: 00000048.jpg 88 KB 245761: 00001102.png 12 KB 5642242: 00002310.pdf 301 KB 1182720
You can restrict carving to the types you care about with -t (for example -t jpg,png,pdf,zip), which is faster and cuts down on false hits. The audit file records the byte offset each file was carved from, which is useful when you need to go back and look at the surrounding bytes by hand. The foremost project page lists the built-in signatures and the config-file format for adding your own.
scalpel is a faster, more configurable descendant of the same idea. You edit scalpel.conf to enable the signatures you want (each line is a header, an optional footer, and a maximum carve size), then run it against the blob:
# enable a line in /etc/scalpel/scalpel.conf, e.g.:# png y 20000000 \x89PNG\x0d\x0a\x1a\x0a \x49\x45\x4e\x44\xae\x42\x60\x82$ scalpel disk.img -o carved_scalpel/$ ls carved_scalpel/png-0-0/ zip-2-0/ audit.txt
The disk-image flavor of this shows up in picoCTF 2021 Disk, disk, sleuth, where you mount or carve a raw image to recover the hidden file. Carving is the fallback for when the structured filesystem tools do not surface what you are after.
What is a polyglot file, and how do I spot one?
A polyglot is a single file that is simultaneously valid as two or more formats. The classic is a file that is both a valid PDF and a valid ZIP, or a GIF that is also runnable JavaScript. They work by exploiting a gap in how each format defines its boundaries: a ZIP is located by its central directory at the end of the file, while a PDF or a GIF is parsed from the start, so a single byte stream can satisfy both parsers at once.
A polyglot is not corruption. It is a file that tells the truth to two different parsers at the same time.
You spot a polyglot the same way you spot appended data: file names one format, but binwalk or a second tool reports another signature inside the same bytes. The tell is that the second signature is not a random false positive but a fully formed file with both a header and a trailer. When that happens, try opening the file as each type in turn:
$ file weird.gifweird.gif: GIF image data, version 89a, 320 x 240$ binwalk weird.gif0 0x0 GIF image data5008 0x1390 Zip archive data, at least v1.0 to extract$ unzip weird.gif # the GIF is also a ZIP; just extract itArchive: weird.gifinflating: secret/flag.txt
The defensive lesson is the same as the offensive one: never trust a file's extension or even its file output as the final word. A file can wear two faces. If something feels too clean, run it through a second parser. Steganography challenges lean on this constantly, and the steganography post covers the adjacent tricks (data hidden in pixel bits, in metadata, in trailing bytes) that pair naturally with polyglots.
Which picoCTF challenges build this muscle?
The picoCTF forensics track is a near-perfect ladder for the skills above, from one-command renames up to multi-field PNG reconstruction. Work them roughly in this order:
- picoCTF 2019 extensions and picoCTF 2024 Scan Surprise: trust
fileover the extension. The whole point is to stop believing the filename. - picoCTF 2019 c0rrupt: the deep PNG repair. Damaged signature and chunk fields rebuilt with
pngcheckand a hex editor. This is the one that teaches the chunk format for real. - picoCTF 2019 like1000: nested archives. Pure peeling practice for
binwalkandunzip. - picoCTF 2021 Disk, disk, sleuth: carving and filesystem recovery on a raw image, the natural home for
foremost.
Do them in order and you will have touched every technique in this guide on a real target: identify, patch, repair, extract, carve. After that, the next blob that file calls data is a puzzle you already know how to start.
Quick reference
Triage order for any unidentified file
file mystery.binthenxxd mystery.bin | headthenstrings -n 8 mystery.bin. Compare the first bytes to the table below.- Header wrong but length right? Patch in place:
printf '\x89\x50...' | dd of=f bs=1 seek=0 conv=notrunc. - Header missing entirely? Prepend:
printf '\xff\xd8\xff' | cat - body > fixed.jpg. - PNG that opens to garbage?
pngcheck -v, fix IHDR dimensions and CRC, appendIENDif missing. - File opens fine but feels heavy?
binwalk ffor appended data, thenbinwalk -e for7z x fto extract. - Raw blob or disk image?
foremost -i blob -o carved/to carve every known type by signature.
| Format | Header (hex) | Trailer (hex) |
|---|---|---|
| PNG | 89 50 4E 47 0D 0A 1A 0A | 49 45 4E 44 AE 42 60 82 |
| JPEG | FF D8 FF | FF D9 |
| GIF | 47 49 46 38 39 61 | 3B |
| 25 50 44 46 | 25 25 45 4F 46 | |
| ZIP | 50 4B 03 04 | 50 4B 05 06 (EOCD) |
| ELF | 7F 45 4C 46 | none (section-defined) |
| PCAP | D4 C3 B2 A1 | none (record stream) |
The whole discipline collapses to one habit: read the first sixteen bytes before you do anything else, because a corrupted file is just a known format wearing the wrong hat.