Audio Steganography and Spectrograms for CTF Forensics

You got a .wav and a flag is in it. Now what?

A forensics challenge handed you an audio file and nothing else. Before you put on headphones and listen to thirty seconds of static hoping the flag spells itself out, run three commands. They answer most audio-stego challenges before you ever open an editor.

$ file flag.wav
flag.wav: RIFF (little-endian) data, WAVE audio, ...
$ strings -n 8 flag.wav | head
# plaintext flag, base64, or a hint hiding in the metadata? check first.
# draw the whole file as a frequency picture and look at it
$ sox flag.wav -n spectrogram -o spectrogram.png

That last command is the one that wins. Audio steganography in CTFs almost always falls into one of six buckets, and a spectrogram tells you which one you are in at a glance. Here is the whole decision in one place so you can stop reading and start solving:

What you see or hear	Technique	Go to
Text drawn into the high frequencies of the spectrogram	Spectrogram art	Read it
A warbling, scanning tone that ramps line by line	SSTV image	Decode it
Phone keypad beeps, two tones at once	DTMF tones	Dial it
Sounds totally normal, nothing in the spectrogram	LSB in samples	Extract it
A single beeping tone, long and short bursts	Morse code	Translate it
Garbled speech, too fast, too slow, or backward	Reversed / shifted	Fix it

Key insight: Audio steganography is not image steganography with a different file extension. Images hide data in pixels you cannot see; audio hides data in a time-frequency plane you have to render before you can see it. The spectrogram is your microscope. If you have not built one yet, you have not started the challenge.

This guide is the audio companion to the CTF Steganography overview and the Steganography Tools roundup. If you would rather drag the file onto one page and let a tool triage it, the on-site Stegall analyzer runs the common checks for you. The rest of this post is what to do once you know which bucket you are in.

Reading a spectrogram: the flag is literally drawn in the frequencies

The most common audio-stego trick is also the most direct. The author renders the flag as text, treats that text image as a brightness map, and paints it into the spectrum so the letters appear when you plot frequency against time. To your ears it is a hiss or a buzz near the top of the audible range. To a spectrogram it is picoCTF{...} in block capitals.

You have three ways to render it, in order of convenience:

# 1) sox, no GUI, dumps a PNG you can open anywhere
$ sox flag.wav -n spectrogram -o spectrogram.png
# wider/taller render for a long file or fine print
$ sox flag.wav -n spectrogram -x 1600 -y 800 -o spectrogram.png
# 2) ffmpeg, if sox is not installed
$ ffmpeg -i flag.wav -lavfi showspectrumpic=s=1600x800:legend=1 spectrogram.png

For interactive work, two GUI tools let you zoom, scroll, and adjust the color scale live, which matters when the text is faint or crammed into a narrow band:

Audacity: open the file, click the track name dropdown, choose Spectrogram instead of Waveform. Then open Preferences -> Spectrograms and raise the maximum frequency and window size until the text sharpens.
Sonic Visualiser: built for exactly this. Pane -> Add Spectrogram, then drag the color and gain sliders. Its scrolling and zoom are smoother than Audacity for long files.

Tip: If sox gives you a wall of solid color and no readable text, the message may be packed into a thin slice of the spectrum. In Audacity, set the spectrogram maximum frequency just above where the bright band sits, and shrink the gain range. The letters are often hiding between, say, 15 kHz and 20 kHz where casual listening never reaches.

A spectrogram-art flag is not encoded, encrypted, or split. It is a picture. Once you can read it, you transcribe it by hand and submit. The only real difficulty is render settings, so treat a blank or smeared plot as a tuning problem, not a dead end.

If you can render the spectrum and you still see nothing, you have not ruled out a spectrogram. You have only ruled out your current zoom level.

SSTV: a whole image transmitted as sound

Slow-Scan Television (SSTV) is a ham-radio mode that sends a still image over an audio channel one scan line at a time. The brightness of each pixel becomes an audio frequency, and a sync tone marks the start of each line. When you play an SSTV file it sounds like a rhythmic warble that steps and scans. This is different from spectrogram art: the picture is not visible in a spectrogram. You have to demodulate it with an SSTV decoder, the same way a radio operator would.

The standard tool is QSSTV. It does not read a file directly in older versions, so the classic workflow loops the audio back into QSSTV through a virtual sink while QSSTV listens on that sink:

# install QSSTV and a PulseAudio loopback on Debian/Ubuntu
$ sudo apt install qsstv pavucontrol
# create a virtual sink, then route playback into it
$ pactl load-module module-null-sink sink_name=virtual
$ pactl load-module module-loopback
# 1) launch qsstv, open Options -> Configuration -> Sound,
#    set the input to 'Monitor of Null Output'
# 2) in qsstv, press Receive and turn on Auto Slant + Auto Save
# 3) play the file into the virtual sink:
$ paplay -d virtual flag.wav

QSSTV auto-detects the SSTV mode (Martin, Scottie, Robot, and friends), scans the incoming audio, and paints the received image into its window. When the scan completes you have a picture, and the flag is usually printed on it in plain text.

Note: QSSTV reads live audio, so the loopback step is what trips people up. The goal is simply that the audio QSSTV "hears" is your file, not a microphone. Any virtual cable works: PulseAudio null-sink on Linux, or a tool like VB-Cable on other platforms. Once QSSTV is listening on the monitor of that sink, just play the file.

This is exactly the puzzle behind the picoCTF m00nwalk challenge: a .wav that decodes through QSSTV into an image with the flag on it. The challenge name is a nod to the fact that SSTV is how the first images were relayed from the Moon. If a file sounds like a fax machine arguing with a modem, reach for an SSTV decoder before anything else.

DTMF: the flag is dialed on a phone keypad

Dual-Tone Multi-Frequency (DTMF) is the sound a touch-tone phone makes. Each key plays two sine tones at once, one from a low group and one from a high group, and the pair uniquely identifies the digit. If your file sounds like someone dialing a phone number, the flag is most likely a string of keypad presses: digits, plus * and #, and on a full keypad the letters A through D.

The frequency pairs are fixed, so a spectrogram shows two clean horizontal lines per tone and you can read it by hand if you must. The fast path is a decoder:

# multimon-ng decodes DTMF (and POCSAG, Morse, and more)
# it wants 16-bit signed, mono, 22050 Hz raw samples
$ sox flag.wav -t raw -r 22050 -e signed -b 16 -c 1 - | \
    multimon-ng -t raw -a DTMF -
# output looks like a stream of detected keys:
# DTMF: 7
# DTMF: 3
# DTMF: *

If you do not have multimon-ng, online DTMF decoders accept a .wav upload, and a short Python script using a Goertzel filter or an FFT per tone window will do it too. But the digit-mapping rarely changes, so the table is worth memorizing:

         1209 Hz   1336 Hz   1477 Hz   1633 Hz
697 Hz      1         2         3         A
770 Hz      4         5         6         B
852 Hz      7         8         9         C
941 Hz      *         0         #         D

Warning: DTMF rarely encodes the literal picoCTF{...} wrapper, because the keypad alphabet is tiny. More often the digits are an intermediate step: a phone number that is a hint, a sequence that indexes into something, or ASCII codes you map back to letters. Decode the tones first, then ask what the digits mean.

LSB steganography: the data is in the quietest bit of every sample

When a file sounds completely normal and the spectrogram is clean, suspect least-significant-bit (LSB) encoding. A 16-bit WAV stores each sample as a number from -32768 to 32767. Flipping the lowest bit changes the amplitude by one part in tens of thousands, which is inaudible. Hide one bit of your message in the low bit of each sample and a few seconds of audio carries kilobytes of payload with no perceptible change.

Python's standard library reads WAV without any dependency. Here is a complete LSB extractor for a mono 16-bit file:

import wave
w = wave.open('flag.wav', 'rb')
frames = w.readframes(w.getnframes())
w.close()
# 16-bit little-endian: take the low byte of each 2-byte sample,
# pull its least significant bit
bits = []
for i in range(0, len(frames), 2):
    sample_low_byte = frames[i]          # little-endian low byte
    bits.append(sample_low_byte & 1)
# regroup the bit stream into bytes (MSB first is the common convention)
out = bytearray()
for i in range(0, len(bits) - 7, 8):
    byte = 0
    for b in bits[i:i + 8]:
        byte = (byte << 1) | b
    out.append(byte)
# print the start; the flag is usually right at the front
print(bytes(out[:200]))

That snippet is the right first attempt, but LSB has parameters and the author picked some specific combination. When the output is garbage, you are not necessarily wrong about the technique, you are wrong about one knob. The knobs are:

Bit order: the byte may be assembled least-significant-bit first. Reverse the inner loop and try again.
Channel: a stereo file interleaves left and right samples. The payload may live only in the left channel, so take every other sample.
Sample width: an 8-bit file is one byte per sample, a 24-bit file is three. Read w.getsampwidth() and index accordingly.
Depth: sometimes the low two bits carry data, not just one.

Tip: Do not hand-roll all four combinations if you do not have to. zsteg does not handle audio, but Stegolsb (the stegolsb wavsteg command) extracts WAV LSB payloads with flags for bit depth and byte count. Recovery needs the -r/--recover flag: stegolsb wavsteg -r -i flag.wav -o out.bin -n 1 -b 1000. Try the tool, and fall back to the script above when the layout is non-standard.

The picoCTF wave a flag and surfing the waves challenges live in this family of WAV-sample problems. When nothing is audible and nothing is visible, the data is in the bits, and a few lines of the wave module is the whole solution.

Morse code: a single tone that blinks long and short

If the file is one steady pitch that switches on and off in a pattern of short and long bursts, that is Morse code. A short beep is a dot, a long beep is a dash, gaps separate letters and words. It is the lowest-tech audio channel there is, and it shows up constantly because it is trivial to generate and unmistakable once you recognize it.

You can read it three ways. By ear, if you know Morse. By eye, from a spectrogram or even a waveform where the bursts are obvious. Or automatically:

# multimon-ng has a Morse (CW) decoder
$ sox flag.wav -t raw -r 22050 -e signed -b 16 -c 1 - | \
    multimon-ng -t raw -a MORSE_CW -
# if the timing is off, normalize and trim silence first
$ sox flag.wav clean.wav norm -3 silence -l 1 0.1 1% -1 0.1 1%

When automated decoders stumble (and they do, because they guess the dot length from the audio), fall back to reading the waveform in Audacity. Measure the length of a short burst, call that a dot, and anything roughly three times longer is a dash. Transcribe to dots and dashes, then map with a Morse table. The mapping is short enough to keep nearby:

A .-     B -...   C -.-.   D -..    E .      F ..-.
G --.    H ....   I ..     J .---   K -.-    L .-..
M --     N -.     O ---    P .--.   Q --.-   R .-.
S ...    T -      U ..-    V ...-   W .--    X -..-
Y -.--   Z --..   0 ----- 1 .----  2 ..---  3 ...--

Note: Morse has no lowercase and no curly braces, so a Morse flag is almost always theinside of the wrapper. You decode something like FL4G1SH3R3 and wrap it yourself as picoCTF{...}. Read the challenge prompt for the exact format it expects.

Reversed and speed-shifted audio: it is speech, just mangled

Some challenges hide the flag in plain speech and then distort the playback so a casual listen reveals nothing. The two classic distortions are reversal (the audio plays backward) and speed or pitch shifting (the audio is sped up into a chipmunk squeak or slowed into a growl). Both are reversible with one sox effect.

# play it backward
$ sox flag.wav reversed.wav reverse
# slow it down to half speed without changing pitch
$ sox flag.wav slow.wav tempo 0.5
# change speed AND pitch together (undo a chipmunk effect)
$ sox flag.wav fixed.wav speed 0.5
# shift pitch only, in cents (1200 cents = one octave down)
$ sox flag.wav down.wav pitch -1200

The distinction between tempo, speed, and pitch matters. tempo changes duration but preserves pitch, speed changes both the way a turntable does, and pitch changes pitch without touching duration. If the speech sounds high and fast, the author used speed to compress it, so undo it with speed at the reciprocal factor.

Watch for layered tricks. A challenge can hide one message in the left channel and a different one in the right, or reverse only part of the file. Split and inspect channels independently:

# split a stereo file into two mono files
$ sox flag.wav left.wav remix 1
$ sox flag.wav right.wav remix 2
# subtract one channel from the other to expose a hidden difference signal
$ sox -m -v 1 left.wav -v -1 right.wav diff.wav

Warning: Reversed audio in a spectrogram looks like the mirror image of normal speech, but it is easy to mistake for a stego payload. If a spectrogram looks "almost like speech but wrong," try reverse before you go hunting for LSB or SSTV. The simplest explanation usually wins in forensics.

The picoCTF waves over lambda challenge is a reminder that "waves" in a title does not always mean audio processing, so confirm the file type first. But when the file really is audio and it sounds like garbled talking, reversal and speed are the two cheapest fixes to try.

picoCTF audio challenges, mapped to the buckets

picoCTF uses audio forensics sparingly, but the challenges it does ship cover the main techniques cleanly. Use them to drill each bucket on a known-good target:

picoCTF 2019 m00nwalk is the canonical SSTV challenge. Loop the .wav into QSSTV and read the flag off the decoded image.
picoCTF 2021 wave a flag is a WAV-sample problem where the answer lives in the data, not the sound.
picoCTF 2021 surfing the waves extends the same idea: render, inspect, and pull the payload out of the samples.
picoCTF 2019 waves over lambda is the reminder to always file the artifact first and not assume the title describes the medium.

For each one, force yourself to name the bucket before you touch a tool. The discipline of "render the spectrogram, then decide" is what separates a five-minute solve from an hour of flailing.

Quick reference

Decision list for any audio file

file flag.wav and strings -n 8 flag.wav. Rule out a plaintext flag or a metadata hint before anything clever.
sox flag.wav -n spectrogram -o s.png. Open it. Text in the high band means spectrogram art, transcribe and done.
Sounds like a scanning fax machine? SSTV. Loop it into QSSTV and read the image.
Sounds like phone dialing? DTMF. Pipe through multimon-ng -a DTMF and map the digits.
One tone blinking short and long? Morse. multimon-ng -a MORSE_CW or read the waveform.
Garbled speech? Try sox ... reverse, then speed 0.5, then split the channels.
Sounds perfectly normal and the spectrogram is clean? LSB. Pull the low bit of every sample with the wave module, then vary bit order, channel, and depth.

Command cheat sheet

# render a spectrogram
sox in.wav -n spectrogram -x 1600 -y 800 -o s.png
ffmpeg -i in.wav -lavfi showspectrumpic=s=1600x800 s.png
# decode DTMF or Morse
sox in.wav -t raw -r 22050 -e signed -b 16 -c 1 - | multimon-ng -t raw -a DTMF -
sox in.wav -t raw -r 22050 -e signed -b 16 -c 1 - | multimon-ng -t raw -a MORSE_CW -
# undo distortions
sox in.wav out.wav reverse
sox in.wav out.wav speed 0.5
sox in.wav out.wav tempo 0.5
# split stereo channels
sox in.wav left.wav remix 1
sox in.wav right.wav remix 2

The whole skill fits in one habit: render before you reason. An audio file is a picture you cannot see and a message you cannot hear until you transform it, so transform it first and let the spectrogram tell you which of the six doors to walk through.