Web Recon for CTF: robots.txt, Page Source, DevTools, and Hidden Endpoints

The flag is usually hidden in plain sight

You opened your first web challenge. The page loads, there is a title, maybe a login box, maybe nothing at all. There is no obvious place to type a flag, no big red SOLVE ME button. So where is it? In the overwhelming majority of beginner web challenges, the flag (or the next clue to it) is already on the page or one short request away. You just have to look where the browser does not show you by default.

Here is the entire entry-level recon loop, in order. Run it on every web challenge before you do anything clever:

1. View source (Ctrl+U). Read every HTML comment and <script> tag.
2. Open DevTools (F12). Check the Network tab and the Sources tab.
3. Fetch /robots.txt and /sitemap.xml.
4. Probe the obvious files: /.git/, /backup.zip, /index.php.bak, /admin.
5. Look at cookies and localStorage in the Application tab.
6. If nothing yet, brute-force directories with ffuf or gobuster.

That is the whole post in six lines. Everything below explains why each step works, what you are actually looking at, and the exact commands to run. If you are brand new to CTFs, start with the picoCTF Beginners Guide for the lay of the land, then come back here for the web-specific moves.

Note: Recon never stops being useful. The same six steps that solve a 50-point intro challenge are the first thing experienced players do on a 500-point one. The hard challenges just hide the clue deeper, behind a second request or an obfuscated script. The reflex is identical.

What is the recon mindset, and why does it win?

A web server hands your browser a pile of files: HTML, JavaScript, CSS, images, and the occasional configuration leak. The browser renders a tidy page out of that pile and throws the messy parts away from your view. Recon is the habit of refusing to accept the rendered page as the whole truth. The flag lives in the pile, not the picture.

The rendered page is what the author wanted you to see. Recon is reading everything they forgot to hide.

Three questions drive every web recon session. Keep them taped to your monitor:

What did the server actually send? Not the rendered page, the raw bytes. Comments, hidden fields, and dead code all survive in the source.
What else is on this server that I was not linked to? Backup files, admin panels, old endpoints, and version-control directories sit at predictable paths.
What is the client storing or sending on my behalf? Cookies, tokens, and local storage are a recon surface the page never advertises.

Beginners lose time because they treat a web challenge like a puzzle to be reasoned out from the visible page. Strong players treat it like a search problem: enumerate everything the server exposes, then read it. The challenge picoCTF 2019 dont-use-client-side is the canonical lesson here: the password check runs entirely in JavaScript that was shipped to your browser, so reading the source hands you the answer.

How do I read the page source and find hidden comments?

The fastest recon move on the planet is Ctrl+U (or right-click and choose "View Page Source"). It shows the raw HTML the server sent, before JavaScript rewrote anything. Authors of intro challenges love to tuck the flag, a hint, or a hidden path into an HTML comment that never renders on screen:

<!-- TODO: remove before launch -->
<!-- flag is at /super_secret_admin_page.html -->
<!-- picoCTF{c0mm3nts_4r3_n0t_s3cr3t} -->
<input type="hidden" name="debug" value="0">

Read the whole thing. Comments, hidden <input> fields, base64 blobs in data- attributes, and linked .js files are all fair game. If you prefer the terminal, curl fetches the raw source so you can grep it:

# Dump the raw HTML the server sends
curl -s http://example.com:8080/
# Pull out just the comments and hidden fields
curl -s http://example.com:8080/ | grep -E '<!--|hidden|picoCTF'
# List every script the page loads so you can read each one
curl -s http://example.com:8080/ | grep -oE 'src="[^"]+\.js"'

Tip: View-source shows the original HTML. If the page builds itself with JavaScript after loading, the flag may not be in view-source at all. For that, use the DevTools Elements tab (which shows the live, post-script DOM) and the Sources tab (which shows the scripts themselves). The two views answer different questions.

Two picoCTF challenges drill this exact skill. picoCTF 2022 Inspect HTML hides the flag in the markup, and picoCTF 2022 Search Source buries it inside one of several linked source files so you have to grep across all of them. Solve both and the reflex sticks.

What do the DevTools Network and Sources tabs reveal?

Press F12 (or Ctrl+Shift+I) to open Developer Tools. It is the single most important web recon instrument in the browser, and it is free. Four tabs matter for CTF recon:

Elements shows the live DOM after JavaScript has run. Use it when the page builds content dynamically and view-source comes up empty.
Network records every request the page made: the HTML, every script, every image, and crucially every background API call (often labeled XHR or Fetch). Reload with the tab open and read the list top to bottom.
Sources lists every JavaScript file the page loaded, fully readable and pretty-printable. This is where client-side logic, hardcoded endpoints, and the occasional plaintext credential live.
Application shows cookies, localStorage, and sessionStorage. Covered in its own section below.

The Network tab is the one beginners under-use. Many challenges fetch data from a /api/ endpoint that is never linked anywhere on the visible page. Reload the page with Network open, click each request, and read its Response. You will often see the endpoint that holds the next clue:

GET /                      200   document   (the page)
GET /style.css             200   stylesheet
GET /app.js                200   script     <- read this in Sources
GET /api/v1/user?id=1      200   xhr        <- try id=2, id=0, id=admin
GET /api/v1/flag           403   xhr        <- interesting, why forbidden?

In the Sources tab, click the { } pretty-print button to un-minify a script before reading it. Search across all loaded scripts with Ctrl+Shift+F for strings like flag, password, admin, secret, and /api. For a deeper tour of the tooling, the official MDN guide to browser developer tools is the authoritative reference.

Key insight: Anything your browser can see, you control. Client-side password checks, hidden form fields, and disabled buttons are not security boundaries, they are suggestions. If the logic runs in JavaScript on your machine, you can read it, edit it, or skip it entirely from the DevTools console.

What can robots.txt and sitemap.xml leak?

robots.txt is a plaintext file at the web root that tells search-engine crawlers which paths to skip. The irony is delicious: to tell a crawler to stay out of /admin, the site has to publicly name /admin. For a CTF, the Disallow list is a free map of the paths the author considered sensitive enough to hide:

$ curl -s http://example.com:8080/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /backup/
Disallow: /secret_flag_directory/     <- well, thanks
Disallow: /api/internal/

Every path in that Disallow list is somewhere you should immediately visit. sitemap.xml is the companion file: it lists URLs the site wants indexed, and sometimes it includes old or forgotten pages that are not linked from the homepage. Fetch both on every challenge:

curl -s http://example.com:8080/robots.txt
curl -s http://example.com:8080/sitemap.xml
curl -s http://example.com:8080/security.txt
curl -s http://example.com:8080/.well-known/

This is exactly the lesson of picoCTF 2019 Where are the robots, whose entire solution is reading robots.txt and following the disallowed path it reveals. It is also a stop on the longer scavenger hunt in picoCTF 2021 Scavenger Hunt, where pieces of the flag are scattered across the page source, the CSS, the robots.txt, and other recon surfaces, one clue pointing to the next.

How do exposed .git directories and backup files give the game away?

Developers deploy by copying a folder to the server, and that folder often still contains the things that were never meant to ship. The two richest finds are exposed version control and stray backup files.

A .git directory left in the web root is a full history of the source code. If http://target/.git/HEAD returns content instead of a 404, you can reconstruct the entire repository, including files that were deleted in later commits but still live in history:

# Is .git exposed? If this returns 'ref: refs/heads/...' you are in business
curl -s http://example.com:8080/.git/HEAD
# Dump the whole repo from an exposed .git directory
pip install git-dumper
git-dumper http://example.com:8080/.git/ ./loot
# Then read the history for anything deleted or secret
cd loot && git log --all --oneline && git show <commit>

Backup and editor-leftover files are the other staple. When someone edits index.php with certain editors, a copy gets left behind. Configuration and archive files get forgotten in the web root. Probe the predictable names directly:

curl -s -o /dev/null -w '%{http_code} %{url_effective}\n' \
  http://example.com:8080/index.php.bak \
  http://example.com:8080/index.php~ \
  http://example.com:8080/.index.php.swp \
  http://example.com:8080/backup.zip \
  http://example.com:8080/config.php.old \
  http://example.com:8080/.env

Warning: A 200 status code means the file exists and you can fetch it. A 403 means it exists but access is forbidden, which is itself a strong signal that something is there worth fighting for. A 404 means keep looking. Always read the status code, not just the page body.

How do I brute-force hidden directories with ffuf and gobuster?

When manual probing runs dry, automate it. Directory brute-forcing throws a wordlist of common file and folder names at the server and reports which ones exist. Two tools dominate: ffuf and gobuster. They do the same job; pick whichever you have installed.

You need a wordlist. The community standard is SecLists; a great starting list is directory-list-2.3-medium.txt or, for a quick first pass, the smaller common.txt. Here is gobuster:

# gobuster: dir mode, one wordlist, show found paths
gobuster dir \
  -u http://example.com:8080 \
  -w /usr/share/seclists/Discovery/Web-Content/common.txt \
  -t 40
# Add common extensions so it finds files, not just folders
gobuster dir -u http://example.com:8080 \
  -w common.txt -x php,html,txt,bak,zip

And the same idea with ffuf. The FUZZ keyword marks where each wordlist entry gets substituted:

# ffuf: FUZZ is replaced by each line of the wordlist
ffuf -u http://example.com:8080/FUZZ \
  -w /usr/share/seclists/Discovery/Web-Content/common.txt
# Filter out the noise: hide 404s, or match only 200 and 403
ffuf -u http://example.com:8080/FUZZ -w common.txt -mc 200,301,302,403
# Hide responses of a boring size (e.g. a default 'not found' page)
ffuf -u http://example.com:8080/FUZZ -w common.txt -fs 1234

The skill in brute-forcing is filtering. Servers often return 200 for everything (a catch-all page), so a naive run reports thousands of false hits. Use -mc (match status code) and -fs (filter by response size) in ffuf, or --status-codes and --exclude-length in gobuster, to cut the wall of noise down to the handful of paths that are genuinely different.

Note: Brute-forcing is loud and slow. On a CTF instance it is fine and expected. Never point these tools at a site you do not have explicit permission to test. In a CTF, the scope is the challenge box and nothing else.

The reward for finding a hidden path is the next half of the challenge, which is often access control. picoCTF 2022 Forbidden Paths is a clean follow-on: once you know a file exists, the puzzle becomes reaching it through a path that the server tried to restrict.

Why are cookies and localStorage a recon surface?

The server does not only send you a page. It often hands your browser small pieces of state to carry back on every request: cookies. Open the DevTools Application tab and look under Storage. You will see cookies, localStorage, and sessionStorage. All three are fully readable and editable by you.

Beginner challenges frequently store the access decision in a cookie and trust it blindly. If you see a cookie like admin=0 or role=guest, the obvious experiment is to change it:

# Read the cookies the server set
curl -s -i http://example.com:8080/ | grep -i set-cookie
# Send a tampered cookie back and see what changes
curl -s http://example.com:8080/ --cookie 'admin=1'
curl -s http://example.com:8080/ --cookie 'role=administrator'

You can also edit cookies live in the Application tab, double-clicking the value and reloading. The challenge picoCTF 2021 Cookies is built entirely around this: the site tracks which "cookie" you picked in a cookie value, and walking that value through its range reveals the flag. It is the perfect introduction to treating client-side state as something you own.

Tip: A cookie that looks like random noise ending in two equals signs (...==) is probably base64. A cookie with two dots splitting it into three chunks (xxxxx.yyyyy.zzzzz) is a JSON Web Token. Both are decodable and sometimes forgeable. The Cookies and JWT for CTF post covers how to decode, tamper with, and forge them.

When should I stop using the browser and open Burp Suite?

Everything above runs in a browser, curl, and two command-line tools. That covers most beginner web challenges. You hit the ceiling of those tools when the challenge needs you to intercept and modify requests mid-flight, replay a request dozens of times with small changes, or work through a multi-step flow where each request depends on the last.

That is when you reach for an intercepting proxy. The standard is Burp Suite (the free Community Edition is plenty for CTF). Burp sits between your browser and the server, so you can pause any request, edit headers or parameters the browser would never let you touch, and send it on. Reach for it when you need to:

Change a request method, header, or body that the page hardcodes (for example, flip a POST field the form will not let you edit).
Replay one request many times with tweaks, using Burp Repeater, to test how the server responds to each variation.
Automate a parameter sweep (Burp Intruder) such as trying every user ID or every value of a token.
Inspect or strip headers that the browser adds automatically and the challenge cares about.

The handoff is natural: recon with the browser and command line tells you where the interesting endpoints are, and Burp lets you manipulate the requests to them. The dedicated Burp Suite for CTF post walks through setup, the proxy, Repeater, and Intruder from zero.

Key insight: Recon finds the door. Burp lets you pick the lock. Most beginners try to pick locks before they have walked the whole building, which is why they get stuck. Enumerate first, manipulate second.

Which picoCTF challenges teach pure recon?

picoCTF is the best place to drill these reflexes because its web track is dense with recon-only challenges. Work through these roughly in order and each step of the loop above gets its own dedicated practice:

picoCTF 2022 Inspect HTML and picoCTF 2022 Search Source are pure view-source and grep-the-scripts practice.
picoCTF 2019 Where are the robots is the robots.txt challenge by name.
picoCTF 2019 dont-use-client-side teaches that anything running in your browser is yours to read.
picoCTF 2021 Scavenger Hunt chains every recon surface together, one clue pointing to the next.
picoCTF 2021 Cookies introduces client-side state as a recon and tampering surface.
picoCTF 2022 Forbidden Paths is where recon (finding a path) hands off to exploitation (reaching it).

Once recon feels automatic, the natural next steps are the injection families. The SQL Injection for CTF post and the Command Injection for CTF post both start from an endpoint you found during recon and turn it into a flag.

Quick reference: the recon checklist

Run this on every web challenge, in order

Ctrl+U view source. Read every comment, hidden field, and <script> tag.
F12 DevTools. Network tab for background API calls, Sources tab to read every .js, Elements tab for the live DOM.
Fetch /robots.txt, /sitemap.xml, and /.well-known/. Visit every Disallow path.
Probe exposed files: /.git/HEAD, /.env, /backup.zip, /index.php.bak. Watch the status code.
Application tab: read and tamper with cookies, localStorage, sessionStorage.
Stuck? Brute-force with ffuf -u http://target/FUZZ -w common.txt -mc 200,301,403 or gobuster dir -u http://target -w common.txt -x php,bak,zip.
Need to forge or replay requests? Open Burp Suite and switch to the proxy workflow.

curl recon one-liners

# Raw source with headers
curl -s -i http://target/
# Just the status code of a path (great for probing files)
curl -s -o /dev/null -w '%{http_code}\n' http://target/.git/HEAD
# Follow redirects and keep cookies across requests
curl -s -L -c jar.txt -b jar.txt http://target/
# Send a tampered cookie
curl -s http://target/ --cookie 'admin=1'

The whole discipline reduces to one habit: never trust the rendered page to be the whole story. View the source, watch the network, read the cookies, and knock on every door the server forgot to lock. Do that first, every time, and most beginner web challenges solve themselves.