ARM Assembly for CTF: The ARMssembly Series

You already know enough to read AArch64

You open the picoCTF ARMssembly challenge, expecting the mov rax, ... you read fluently, and instead the file is full of w0, str, ldr, and b.eq. It looks alien. It is not. AArch64 (the 64-bit ARM instruction set, also called ARM64) is, if anything, easier to read than x86 once you accept three rules.

Here is the whole language in one breath. Registers are named x0 through x30 for 64-bit access and w0 through w30 for the low 32 bits of the same register. Arithmetic only ever touches registers, never memory, so add w0, w1, w2 means w0 = w1 + w2. The only instructions that touch memory are ldr (load register from memory) and str (store register to memory). Function arguments arrive in x0 through x7, and the return value goes back in x0. That is 90 percent of every ARMssembly challenge.

x86 lets any instruction reach into memory. ARM does not. Compute in registers, move to and from memory only with ldr and str. Once that clicks, the disassembly reads like pseudocode.

This is the sibling of the x86 Assembly for CTF post. If you have read that one, you already own the mental model (registers, stack, control flow); this post is just the translation table for ARM. The five picoCTF ARMssembly challenges, from ARMssembly 0 up, ask you to hand-trace a short function and report the value it returns. No exploit, no shell. Just reading. By the end of this page you will trace one with me.

Note: ARMssembly source files are AArch64 emitted by GCC. The same skills carry to real firmware reversing, Android native libraries, Apple Silicon binaries, and embedded CTF targets, which are increasingly ARM. Learning to read it once pays off well past picoCTF.

How is ARM different from x86, structurally?

Three architectural choices explain almost every surface difference you will notice. Internalize these and the rest is vocabulary.

Load/store architecture

Only ldr and str talk to memory. Every other instruction operates strictly on registers. In x86 you can write add [rbx], rax; in ARM you must load, add, then store back.

Fixed-width instructions

Every AArch64 instruction is exactly 4 bytes. No variable-length decoding, no mid-instruction jumps. Addresses advance by 4 every time, which makes manual tracing far less error-prone than x86.

Three-operand form

ARM writes add dst, src1, src2: destination first, then two sources, and the destination is never silently overwritten the way x86's add rax, rbx (which means rax += rbx) clobbers rax.

There is one more thing worth flagging up front: ARM has no push or pop. The compiler manages the stack by hand, decrementing sp and using str/ldr with an offset. You will see stp and ldp (store-pair and load-pair) constantly: they move two registers at once, which is how the prologue saves the frame pointer and return address together.

Key insight: x86 is a CISC design with hundreds of instructions and many addressing modes. AArch64 is RISC: a small, regular instruction set where the same handful of operations cover everything. The disassembly is more verbose (more instructions for the same task) but each instruction does exactly one obvious thing.

What are the registers, and what is the w vs x thing?

AArch64 gives you 31 general-purpose registers plus a few special ones. The naming is the only thing that trips people up.

x0  - x30   64-bit general-purpose registers (31 of them)
w0  - w30   the low 32 bits of x0..x30 (same physical register)
x0  - x7    function arguments and return values
x8          indirect result / syscall number on Linux
x9  - x15   caller-saved scratch (temporaries)
x19 - x28   callee-saved (a function must preserve these)
x29 / fp    frame pointer
x30 / lr    link register: the return address
sp          stack pointer (NOT x31; xzr shares the encoding)
pc          program counter (read-only on AArch64)
xzr / wzr   the zero register: reads as 0, writes are discarded

When you see w0 and x0 in the same listing, they are the same register seen at two widths. Writing to w0 zeroes the upper 32 bits of x0, exactly like writing eax zeroes the top of rax in x86. ARMssembly challenges are full of w registers because the C source uses int (32-bit) locals, so the compiler works at 32-bit width.

Tip: The two registers you must always track are lr (x30), which holds where the function returns to, and sp, which the prologue moves to carve out local space. The instant a function is entered, the first stp usually saves x29 and x30 so the return address survives any nested calls.

The zero register is a small genius move. There is no dedicated "clear" instruction because mov w0, wzr already means "set w0 to zero", and cmp w0, wzr means "compare w0 against zero". When you spot wzr or xzr, mentally substitute the literal 0.

How do I read data processing instructions?

Data processing is the arithmetic and logic core. Every one of these takes a destination register followed by its sources, and never touches memory. Read add dst, a, b as dst = a + b and you are done.

mov  w0, w1            // w0 = w1            (copy)
mov  w0, #5            // w0 = 5             (# marks an immediate constant)
add  w0, w1, w2        // w0 = w1 + w2
add  w0, w1, #1        // w0 = w1 + 1
sub  w0, w1, w2        // w0 = w1 - w2
mul  w0, w1, w2        // w0 = w1 * w2
lsl  w0, w1, #2        // w0 = w1 << 2       (logical shift left = multiply by 4)
lsr  w0, w1, #1        // w0 = w1 >> 1       (logical shift right)
and  w0, w1, w2        // w0 = w1 & w2       (bitwise AND)
orr  w0, w1, w2        // w0 = w1 | w2       (bitwise OR)
eor  w0, w1, w2        // w0 = w1 ^ w2       (XOR; 'exclusive or')

Two gotchas for x86 readers. First, the destination comes first, like AT&T syntax but unlike Intel syntax, so do not read sub w0, w1, w2 as w1 - w0. Second, an immediate is written with a # prefix, so #0x1b is the constant 27, not a memory reference.

The shift instructions are where challenge authors hide arithmetic. lsl w0, w0, #3 is a multiply by 8, and you will see compilers emit a shift instead of a mul whenever the multiplier is a power of two. If a trace produces a number that looks too large, check whether you read a shift as an add.

Warning: eor w0, w0, w0 is the idiom for "set w0 to zero" (anything XOR'd with itself is 0), exactly like xor eax, eax in x86. Do not trace it literally as a meaningful XOR; it is just a clear.

What do ldr and str actually do?

These are the only two instructions that cross the boundary between registers and memory, so in a load/store architecture they carry all the traffic. The mnemonic tells you the direction: ldr loads into a register from memory, str stores from a register out to memory.

ldr  w0, [x1]          // w0 = *(int*)x1         (load from the address in x1)
ldr  w0, [x1, #8]      // w0 = *(int*)(x1 + 8)   (offset addressing)
str  w0, [x1]          // *(int*)x1 = w0         (store w0 to that address)
str  w0, [sp, #12]     // store a local variable into the stack frame
stp  x29, x30, [sp, #-16]!  // store-PAIR: push fp and lr, pre-decrement sp
ldp  x29, x30, [sp], #16    // load-PAIR: restore fp and lr, post-increment sp
adrp x0, #0x21000      // form the page address of a global (used with :lo12:)

The bracket is the dereference. [x1, #8] is "the memory at address x1 + 8", the direct analogue of x86's [rbx+8]. The ! suffix means write the computed address back into the base register (pre-index), which is how stp x29, x30, [sp, #-16]! both reserves 16 bytes and saves the pair in one instruction. That single line is the ARM equivalent of x86's function prologue push rbp.

Globals are loaded in two steps because a 64-bit address does not fit in one 4-byte instruction. You will see adrp x0, somewhere (get the 4 KB page) followed by add x0, x0, #:lo12:somewhere or an ldr with a :lo12: offset (add the low 12 bits). Treat the pair as a single "put the address of this global into x0".

Note: The size of the transfer follows the register width. ldr w0, [x1] moves 4 bytes (a 32-bit int), ldr x0, [x1] moves 8. You will also see ldrb (byte) and ldrh (halfword, 2 bytes) for char and short access.

How does control flow work without flags everywhere?

ARM control flow is a two-step dance: an instruction that sets the condition flags (almost always cmp), followed by a conditional branch that reads them. This is the same NZCV flag model as x86's cmp plus jcc, only the branch mnemonics differ.

cmp  w0, w1        // set flags from (w0 - w1), discard the result
cmp  w0, #0        // compare against the constant 0
b    label         // unconditional branch (like x86 jmp)
b.eq label         // branch if equal       (w0 == w1)
b.ne label         // branch if not equal   (w0 != w1)
b.lt label         // branch if less than    (signed <)
b.gt label         // branch if greater than (signed >)
b.le label         // branch if <=
b.ge label         // branch if >=
cbz  w0, label     // Compare and Branch if Zero      (if w0 == 0)
cbnz w0, label     // Compare and Branch if Non-Zero  (if w0 != 0)
bl   func          // Branch with Link: call func, save return addr in lr (x30)
ret                // return: jump to the address in lr

The conditional branches read like English once you expand them: b.eq is branch-if-equal, b.ne is branch-if-not-equal, b.lt is branch-if-less-than. They always pair with a preceding cmp (or a subs/adds, the s suffix meaning "set flags"). When you trace one, the question is simply: did the last cmp make this condition true?

The cbz and cbnz forms fold the compare and the branch into one instruction, testing a single register against zero with no preceding cmp needed. They are extremely common in loop and null-check code, so do not go looking for a flag-setter before them; there is not one.

Tip: For calls, the rule is: bl func stashes the return address in lr, and ret jumps back to whatever lr holds. Because a nested bl would overwrite lr, any non-leaf function saves lr to the stack in its prologue (the stp x29, x30 you keep seeing) and restores it before ret.

What is the AArch64 calling convention?

The procedure call standard for AArch64 (AAPCS64) is the contract every compiled function obeys, and knowing it lets you read a bl without tracing into the callee. It is simpler than the x86_64 System V ABI.

Integer / pointer arguments:   x0, x1, x2, x3, x4, x5, x6, x7  (in order)
Return value:                  x0  (or w0 for a 32-bit int)
Extra args beyond 8:           passed on the stack
Caller-saved (scratch):        x0-x18  (callee may clobber freely)
Callee-saved (preserved):      x19-x28, plus x29 (fp) and x30 (lr)
Frame pointer:                 x29 (fp)
Return address:                x30 (lr)

So when you see mov w0, #42 then bl power, you are reading power(42), and whatever power leaves in w0 is its return value. The ARMssembly challenges lean entirely on this: main loads constants into w0 and friends, calls a helper, and prints whatever comes back in w0. The flag is that returned number, formatted.

Note: The authoritative documents are the AAPCS64 procedure call standard and the Arm Architecture Reference Manual. You will not need them for picoCTF, but they are the ground truth when a real firmware target does something unusual.

Can we hand-trace a real ARMssembly function?

Here is the shape of an ARMssembly challenge: a small function plus a main that calls it with fixed constants, and your job is the single number it returns. The listing below is representative of the style. We trace it by hand, one 4-byte instruction at a time.

func1:
    sub  sp, sp, #16        // reserve 16 bytes of stack frame
    str  w0, [sp, #12]      // save arg0 into the frame
    str  w1, [sp, #8]       // save arg1 into the frame
    ldr  w1, [sp, #12]      // w1 = arg0
    ldr  w0, [sp, #8]       // w0 = arg1
    lsl  w0, w1, w0         // w0 = arg0 << arg1
    add  sp, sp, #16        // release the frame
    ret                     // return w0
main:
    stp  x29, x30, [sp, #-32]!   // prologue: save fp/lr, reserve 32 bytes
    mov  w0, #3                  // arg0 = 3
    mov  w1, #4                  // arg1 = 4
    bl   func1                   // call func1(3, 4)
    str  w0, [sp, #28]           // save the result
    ldr  w0, [sp, #28]           // reload it as the return value
    ldp  x29, x30, [sp], #32     // epilogue: restore fp/lr
    ret

Trace main first. It sets w0 = 3 and w1 = 4, then calls func1. By the calling convention, that is func1(3, 4) with arg0 = 3 and arg1 = 4. Now step into func1: it stores both args to the frame, reloads arg0 into w1 and arg1 into w0 (note the swap), then runs lsl w0, w1, w0, which is w0 = w1 << w0 = 3 << 4.

A left shift by 4 is a multiply by 16, so 3 << 4 = 48. That value rides back to main in w0, gets saved and reloaded, and is the function's answer: 48. The ARMssembly flag format wraps this number, typically as the zero-padded 8-digit hexadecimal of the result, so 48 decimal becomes 0x00000030.

Warning: The single most common mistake is misreading the operand order on the shift. The destination is first, so lsl w0, w1, w0 shifts w1 by w0, not the other way around. When two registers get swapped through the stack like this, write down what each holds at every step rather than trusting your memory of which was which.

That is the entire skill the early ARMssembly challenges test. The later ones add a loop (watch for a cbnz or b.ne branching back up) or a comparison that picks one of two return paths (a cmp then b.gt), but the method does not change: track every register through every 4-byte step until ret.

How do I run and debug an ARM binary on an x86 box?

Some ARMssembly challenges hand you a real ELF rather than a source listing, and you want to run it on your x86_64 laptop. QEMU's user-mode emulation runs a single ARM binary by translating instructions on the fly, no full virtual machine required.

# Debian/Ubuntu: install the emulator and the cross toolchain
$ sudo apt install qemu-user qemu-user-static gdb-multiarch
$ sudo apt install gcc-aarch64-linux-gnu libc6-arm64-cross
# Identify the target first
$ file ./chall
chall: ELF 64-bit LSB pie executable, ARM aarch64, dynamically linked
# Run a static AArch64 binary directly
$ qemu-aarch64-static ./chall
# Run a dynamically linked one: point QEMU at the ARM libraries
$ qemu-aarch64-static -L /usr/aarch64-linux-gnu ./chall

To debug, have QEMU open a gdb stub with -g and a port, then attach with gdb-multiarch (plain gdb only knows the host architecture). The workflow is identical to the one in the GDB CTF Guide, just cross-architecture.

# Terminal 1: run under QEMU, waiting for a debugger on port 1234
$ qemu-aarch64-static -g 1234 ./chall
# Terminal 2: attach gdb-multiarch to the stub
$ gdb-multiarch ./chall
(gdb) set architecture aarch64
(gdb) target remote :1234
(gdb) break main
(gdb) continue
(gdb) info registers x0 x1 w0 sp lr   # inspect the ARM register file
(gdb) stepi                            # single-step one 4-byte instruction
(gdb) x/8i $pc                         # disassemble the next 8 instructions

If you only have the source listing, you do not even need QEMU. Assemble and run it with the cross toolchain, or honestly, just trace it on paper. For static understanding, Ghidra decompiles AArch64 to C and is the fastest way to confirm a hand-trace on a larger function.

Tip: When stepping in gdb-multiarch, stepi always advances exactly 4 bytes because every instruction is fixed-width. Watch x0 across a bl to read a function's return value without ever stepping into the callee.

The five ARMssembly challenges, in order

The picoCTF 2021 ARMssembly series is a graded ramp. Each one adds exactly one new construct, so do them in order and the difficulty curve stays gentle.

ARMssembly 0 is pure orientation: read a couple of mov and arithmetic instructions and report the result. This is where the w0/x0 naming stops feeling foreign.
ARMssembly 1 adds the calling convention and a helper function: trace main into a bl and back, exactly like the walkthrough above.
ARMssembly 2 introduces a loop. Find the cmp plus conditional branch that jumps back up and count the iterations rather than unrolling forever.
ARMssembly 3 mixes shifts, masks, and multiple helpers. This is where reading lsl/and/orr correctly stops being optional.
ARMssembly 4 is the capstone: a longer function with branching control flow where you must hold several registers in your head at once. Use gdb-multiarch to check your hand-trace if the value comes out wrong.

A practical tip for all five: the expected answer is a fixed-width hexadecimal of the computed value, so do your arithmetic in decimal, then convert at the very end. An off-by-one in the hex conversion has sunk more correct traces than any misread instruction.

Quick reference

x0-x7    args + return (x0 = return value)     w0-w7  = low 32 bits
x8       indirect result / Linux syscall number
x9-x18   caller-saved scratch
x19-x28  callee-saved (must be preserved)
x29 fp   frame pointer       x30 lr   return address
sp       stack pointer        pc       program counter
wzr/xzr  zero register: reads 0, writes discarded

Instruction cheat sheet

mov  d, s        d = s                  add d,a,b   d = a + b
sub  d, a, b     d = a - b              mul d,a,b   d = a * b
lsl  d, s, #n    d = s << n             lsr d,s,#n  d = s >> n
and  d, a, b     d = a & b              orr d,a,b   d = a | b
eor  d, a, b     d = a ^ b              (eor d,d,d  -> clear)
ldr  d, [b, #o]  d = *(b + o)  (load)
str  s, [b, #o]  *(b + o) = s  (store)
stp/ldp          store/load a register PAIR (push/pop the frame)
cmp  a, b        set flags from a - b
b.eq b.ne b.lt b.gt b.le b.ge   conditional branch on those flags
cbz/cbnz r,lbl   branch if r is zero / non-zero
bl   func        call (return addr -> lr)    ret   return via lr

Run and debug

file ./chall                              # confirm aarch64
qemu-aarch64-static ./chall               # run static
qemu-aarch64-static -L /usr/aarch64-linux-gnu ./chall   # dynamic
qemu-aarch64-static -g 1234 ./chall       # open a gdb stub
gdb-multiarch ./chall                     # then: target remote :1234

Read the difference once and it disappears. ARM is x86 with the memory accesses pulled out into ldr and str, every instruction the same width, and arguments lined up in x0 through x7. Trace it on paper, one four-byte step at a time, and an ARMssembly "flag" is just the number that lands in w0.