Project 17: MCP Contract Verifier
Conformance report covering schema validity, auth behavior, and safety flags.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 5-10 days (capstone: 3-5 weeks) |
| Main Programming Language | Go |
| Alternative Programming Languages | TypeScript, Python |
| Coolness Level | Level 4: Protocol Engineering |
| Business Potential | 4. Integration Platform |
| Knowledge Area | Protocol Compliance |
| Software or Tool | MCP client/server conformance suite |
| Main Book | Computer Networks (Tanenbaum) |
| Concept Clusters | Tool Calling and MCP Interoperability; Instruction Hierarchy and Injection Defense |
1. Learning Objectives
By completing this project, you will:
- Design a reliable artifact: An automated compliance runner for MCP interfaces that validates protocol conformance, schema contracts, and security posture.
- Understand the MCP specification deeply enough to generate conformance test cases for capability negotiation, tool schemas, resource discovery, and error handling.
- Build a test harness that distinguishes MUST/SHOULD/MAY conformance levels and produces actionable verdicts per contract.
- Validate tool behavioral contracts beyond schema (idempotency, side effects, annotation accuracy) through probe-based testing.
- Implement security posture validation covering OAuth 2.1 flows, input injection resistance, least privilege, and transport security.
- Produce machine-readable conformance reports that integrate into CI/CD pipelines with severity-based gating.
2. All Theory Needed (Per-Concept Breakdown)
Protocol Conformance Testing
Fundamentals Protocol Conformance Testing is the practice of verifying that an implementation correctly follows a protocol specification. For MCP (Model Context Protocol), this means checking that a server or client implementation handles capability negotiation, message framing, error responses, and transport behavior exactly as the specification requires. Conformance testing is fundamentally different from functional testing: functional tests verify that a feature works; conformance tests verify that an implementation adheres to a shared standard so that any compliant client can interoperate with any compliant server. Without conformance testing, the MCP ecosystem fragments: each implementation makes slightly different assumptions about message formats, error codes, and capability behavior, and integrations break in subtle, hard-to-diagnose ways. For this project, conformance testing is the core verification methodology that the entire test suite is built upon.
Deep Dive into the concept At depth, Protocol Conformance Testing for MCP requires understanding three layers: the specification structure, the conformance level system, and the test generation strategy.
The MCP specification is built on JSON-RPC 2.0 as its transport-agnostic message format. Every MCP message is a JSON-RPC request, response, or notification. The specification defines three primitive types that a server can expose: tools (model-invocable functions with input schemas), resources (data sources the model can read), and prompts (reusable prompt templates). The specification also defines a capability negotiation handshake: during initialization, the client sends an initialize request declaring its capabilities, and the server responds with its own capabilities. This handshake determines which features are available for the session. A conformance test suite must verify that this handshake works correctly, that undeclared capabilities are not available, and that the server rejects requests for capabilities it did not advertise.
The conformance level system uses RFC 2119 keywords: MUST, SHOULD, and MAY. A MUST requirement is a hard failure if violated: the implementation is non-conformant. A SHOULD requirement is a strong recommendation; violating it is a warning that indicates poor quality but not protocol breakage. A MAY requirement is optional behavior; testing it confirms feature presence but absence is not a defect. Your test suite must categorize every test case by conformance level and produce verdicts that distinguish hard failures from warnings. This categorization is critical for CI gating: you might block deployment on MUST failures but only warn on SHOULD violations.
Test generation strategy determines how you create test cases. For MCP, the primary approach is specification-driven: read each normative statement in the spec and derive one or more test cases. For example, the spec states that tools/list MUST return an array of tool objects, each with a name (string) and inputSchema (JSON Schema object). From this single statement, you derive: a positive test (call tools/list and verify the response shape), a negative test (verify that tools without name fields are rejected), and an edge case test (verify behavior when the tool list is empty). The second approach is interaction-driven: simulate multi-step client-server conversations and verify that state transitions are correct (e.g., calling a tool before initialization should fail). The third approach is fuzz-based: send malformed messages and verify that the server responds with proper JSON-RPC error codes rather than crashing or hanging.
A critical subtlety is versioning. The MCP specification evolves (e.g., the 2025-11-05 version added tool annotations and Streamable HTTP transport). Your conformance suite must be version-aware: test cases should be tagged with the minimum spec version they apply to, and the suite should skip tests that target features not supported by the server’s declared version.
How this fit on projects This concept is the primary design driver for Project 17. Every test case in the conformance suite is derived from MCP specification statements, categorized by conformance level, and version-tagged. The test harness architecture, report format, and CI integration all flow from this methodology.
Definitions & key terms
- Conformance: the degree to which an implementation follows a specification’s normative requirements.
- Normative statement: a specification requirement using RFC 2119 keywords (MUST, SHOULD, MAY) that defines expected behavior.
- Capability negotiation: the MCP handshake where client and server declare which protocol features they support.
- JSON-RPC 2.0: the message format underlying MCP, defining request/response/notification patterns with
jsonrpc,method,params,id, anderrorfields. - Conformance level: MUST (hard requirement), SHOULD (strong recommendation), MAY (optional feature).
- Specification-driven testing: deriving test cases directly from normative statements in the protocol specification.
Mental model diagram (ASCII)
+-------------------+
| MCP Specification |
| (normative text) |
+-------------------+
|
| extract normative statements
v
+-------------------+
| Test Case |
| Generator |
| - positive tests |
| - negative tests |
| - edge cases |
| - fuzz inputs |
+-------------------+
|
| tagged with: conformance_level, spec_version, category
v
+-------------------+ +-------------------+
| Test Runner |---->| Server Under Test |
| (MCP client) |<----| (MCP server) |
+-------------------+ +-------------------+
|
| compare actual vs expected
v
+-------------------+
| Conformance |
| Verdict |
| - PASS / FAIL |
| - level: MUST |
| / SHOULD / MAY |
| - category |
| - evidence |
+-------------------+
|
v
+-------------------+
| Report Generator |
| - machine JSON |
| - human summary |
| - CI gate signal |
+-------------------+
How it works (step-by-step, with invariants and failure modes)
- Load MCP specification version and extract normative requirements.
- Generate test cases tagged by conformance level, version, and category.
- Establish connection to server under test (stdio, SSE, or Streamable HTTP).
- Send
initializerequest and record server capabilities. - Execute test cases in dependency order (initialization tests first, then capability-specific tests).
- For each test case: send request, capture response, compare against expected behavior.
- Record verdict (PASS/FAIL/SKIP) with evidence (actual response, expected response, spec reference).
- Generate conformance report with summary counts and per-test details.
- Invariant: MUST-level failures always produce a FAIL verdict regardless of other context.
- Failure mode: server closes connection mid-suite; the runner must handle partial results gracefully and report which tests could not be executed.
Minimal concrete example
test_case:
id: "CT-TOOLS-LIST-001"
category: "tool_discovery"
conformance_level: "MUST"
spec_version: "2024-11-05"
spec_reference: "Section 6.1: tools/list"
description: "Server MUST return tools array with name and inputSchema"
steps:
- send: { jsonrpc: "2.0", method: "tools/list", id: 1 }
- expect_response:
id: 1
result:
tools: array_of:
name: string (non-empty)
inputSchema: object (valid JSON Schema)
verdict_if_match: PASS
verdict_if_mismatch: FAIL
evidence: "Captured response vs expected schema"
test_case:
id: "CT-INIT-CAPS-002"
category: "capability_negotiation"
conformance_level: "MUST"
description: "Server MUST NOT expose tools/call if tools capability not declared"
steps:
- send: initialize (without tools capability)
- send: { jsonrpc: "2.0", method: "tools/call", params: {...}, id: 2 }
- expect_error: { code: -32601, message: "Method not found" }
Common misconceptions
- “If the server returns valid JSON, it is conformant.” Conformance requires correct message structure, correct semantics, correct error codes, and correct state transitions, not just parseable responses.
- “Functional tests and conformance tests are the same thing.” Functional tests verify features work for your use case. Conformance tests verify the implementation matches the specification, enabling interoperability with any compliant counterpart.
- “Testing happy paths is sufficient for conformance.” Negative tests (malformed inputs, unauthorized requests, unsupported methods) are where most conformance failures hide. A server that handles happy paths but crashes on malformed input is non-conformant.
- “The spec version does not matter.” MCP evolves; a test case valid for 2025-11-05 may not apply to 2024-11-05. Without version tagging, the suite produces false failures.
Check-your-understanding questions
- What is the difference between a MUST-level failure and a SHOULD-level failure in a conformance report?
- Why must conformance test cases be derived from specification text rather than from observed implementation behavior?
- How should the test suite handle a server that declares a capability in initialization but fails to implement it correctly?
- Why is test execution order important for protocol conformance suites?
Check-your-understanding answers
- A MUST-level failure means the implementation violates a hard requirement and is non-conformant; it should block deployment. A SHOULD-level failure indicates a quality gap but does not break protocol interoperability; it is a warning.
- Deriving tests from implementation behavior encodes implementation bugs as expected behavior. Specification-driven tests verify against the authoritative contract, catching implementation deviations that would break interoperability with other compliant implementations.
- Report it as a conformance failure: the capability declaration is a contract promise. If
toolsis declared in capabilities buttools/listreturns an error, the server has broken its declared contract. This should be a MUST-level failure. - Protocol state matters. The
initializehandshake must succeed before capability-specific tests can run. Testingtools/callbefore initialization produces misleading results because the failure might be due to missing initialization, not a tool implementation bug.
Real-world applications
- W3C Web Platform Tests: the browser conformance suite that verifies HTML/CSS/JS standard compliance across browsers.
- Bluetooth SIG certification: devices must pass conformance tests before using the Bluetooth brand.
- OAuth 2.0 conformance suites that verify authorization server implementations.
- HTTP/2 conformance testing (h2spec) that validates server frame handling and stream management.
Where you’ll apply it
- The core test harness architecture, test case format, and verdict classification.
- The specification parser that extracts normative statements and generates test stubs.
- The report generator that produces conformance reports with MUST/SHOULD/MAY breakdowns.
- The CI integration that gates deployments based on conformance level thresholds.
References
- MCP Specification (modelcontextprotocol.io/specification) - the authoritative protocol definition
- JSON-RPC 2.0 Specification (jsonrpc.org/specification) - the message format underlying MCP
- RFC 2119 “Key words for use in RFCs to Indicate Requirement Levels” - MUST/SHOULD/MAY semantics
- “Computer Networks” by Tanenbaum - protocol layering and conformance testing methodology
- W3C Web Platform Tests project (web-platform-tests.org) - large-scale conformance testing example
Key insights Protocol conformance testing verifies that implementations match the specification (not just that they work), enabling reliable interoperability across the entire ecosystem of clients and servers.
Summary Protocol Conformance Testing derives test cases from MCP specification normative statements, categorizes them by conformance level (MUST/SHOULD/MAY), executes them against a server under test, and produces verdicts that distinguish hard failures from quality warnings, enabling CI gating and ecosystem-wide interoperability assurance.
Homework/Exercises to practice the concept
- Read the MCP specification’s
initializemethod section and list every MUST-level requirement. Write one test case for each. - Design a test case that verifies capability negotiation: a server that declares
toolsbut notresourcesshould accepttools/listand rejectresources/list. Write the full test case specification. - Explain why fuzz testing is valuable for conformance even though the specification does not explicitly define responses to malformed input.
- Design a test execution order for a suite with 5 categories: initialization, capability negotiation, tool discovery, tool execution, error handling. Justify the order.
Solutions to the homework/exercises Strong solutions include: test cases that cite specific spec section numbers and quote the normative text they verify, capability negotiation tests that check both positive (declared capability works) and negative (undeclared capability rejected with correct error code) paths, fuzz testing justifications that reference robustness requirements implicit in JSON-RPC error handling (servers MUST NOT crash on invalid input), and execution orders that respect state dependencies (initialization before capabilities, capabilities before tool execution).
MCP Tool Contract Semantics
Fundamentals
MCP Tool Contract Semantics is about understanding what a tool declaration in MCP actually promises and how to verify those promises beyond simple schema validation. In MCP, a tool is declared with a name, a description, an inputSchema (a JSON Schema object defining valid arguments), and optional annotations that describe behavioral properties like whether the tool is read-only, destructive, or idempotent. The schema tells you what shape the input must have; the annotations and description tell you what the tool will do with that input. But neither the schema nor the annotations are enforced by the protocol itself; they are assertions made by the server developer. A tool annotated as readOnlyHint: true might still mutate state. A tool with a well-defined inputSchema might silently ignore extra fields or crash on edge-case values. Tool contract semantics is the discipline of treating these declarations as testable claims and building verification strategies that detect when actual behavior diverges from declared behavior.
Deep Dive into the concept At depth, MCP Tool Contract Semantics requires understanding three verification layers: schema contract verification, annotation contract verification, and behavioral drift detection.
Schema contract verification checks that the tool’s inputSchema accurately describes what the tool accepts and rejects. This goes beyond “does the schema parse as valid JSON Schema.” You need to verify: does the tool reject inputs that violate the schema (e.g., missing required fields, wrong types, values outside enum constraints)? Does the tool accept all inputs that satisfy the schema? Does the tool handle edge cases at schema boundaries (empty strings, maximum-length arrays, deeply nested objects)? Many implementations have schema declarations that are aspirational rather than enforced; the schema says a field is required but the tool works fine without it, or the schema says a field is a string but the tool crashes on strings longer than 1000 characters. These gaps between declared schema and actual acceptance are contract violations.
Annotation contract verification tests the behavioral hints introduced in MCP 2025-11-05. The specification defines four standard annotations: readOnlyHint (tool does not modify state), destructiveHint (tool may perform irreversible operations), idempotentHint (calling the tool multiple times with the same input produces the same result), and openWorldHint (tool interacts with external systems beyond the server’s control). These annotations inform the LLM’s tool-use decisions; a model should prefer read-only tools over destructive ones for exploratory queries. But annotations are hints, not enforced constraints. Your verifier must test them: call a readOnlyHint: true tool and verify that observable state does not change. Call an idempotentHint: true tool twice with the same input and verify that results match and no additional side effects occur. Call a destructiveHint: false tool and verify that no irreversible changes happen.
Behavioral drift detection addresses the problem that tools change over time. A tool that was read-only in version 1.0 might acquire side effects in version 1.1 without updating its annotations. Your verifier should compare current behavior against a baseline snapshot: if the tool’s response shape changed, if new side effects appeared, or if error handling behavior shifted, the contract may have drifted. Drift detection requires maintaining a history of test results and alerting when results diverge from the established baseline.
A subtle but important aspect is the relationship between tool description and tool behavior. The description field is consumed by the LLM to decide when to invoke the tool. If the description says “searches the internal knowledge base” but the tool actually makes external API calls, the model will misuse it. Testing description accuracy requires semantic analysis: does the tool’s actual behavior match what the description promises? This is harder to automate but can be approximated by checking observable behaviors (network calls, file system changes, database writes) against description claims.
How this fit on projects This concept powers the tool-specific verification logic in Project 17. The verifier engine’s tool contract tests validate schema enforcement, annotation accuracy, and behavioral consistency. Contract drift detection feeds into the trend reporting and regression detection features.
Definitions & key terms
- inputSchema: a JSON Schema object declared by a tool that defines the shape of valid arguments.
- Tool annotations: metadata hints in MCP (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) that describe behavioral properties of a tool.
- Schema enforcement gap: the difference between what a tool’s declared schema allows and what the tool’s implementation actually accepts or rejects.
- Behavioral probe: a test that invokes a tool and observes side effects to verify annotation claims (e.g., checking for state changes after calling a read-only tool).
- Contract drift: a change in tool behavior that is not reflected in updated schema or annotations.
- Baseline snapshot: a recorded set of test results representing known-good tool behavior, used for drift detection.
Mental model diagram (ASCII)
Tool Declaration (from tools/list)
+----------------------------------+
| name: "search_docs" |
| description: "Search internal |
| knowledge base" |
| inputSchema: |
| { query: string (required), |
| limit: integer (optional) } |
| annotations: |
| readOnlyHint: true |
| destructiveHint: false |
| idempotentHint: true |
| openWorldHint: false |
+----------------------------------+
|
| verify each claim
v
+----------------------------------+
| Schema Contract Tests |
| - valid input accepted? |
| - invalid input rejected? | --> Schema Verdict
| - boundary values handled? |
+----------------------------------+
|
v
+----------------------------------+
| Annotation Contract Tests |
| - readOnly: no state change? |
| - idempotent: same result 2x? | --> Annotation Verdict
| - destructive: matches hint? |
| - openWorld: external calls? |
+----------------------------------+
|
v
+----------------------------------+
| Drift Detection |
| - compare to baseline snapshot |
| - response shape changed? | --> Drift Verdict
| - new side effects detected? |
| - error behavior shifted? |
+----------------------------------+
|
v
+----------------------------------+
| Combined Tool Contract Verdict |
| PASS / WARN / FAIL per claim |
+----------------------------------+
How it works
- Retrieve tool declarations from
tools/listresponse. - For each tool, extract
inputSchema,annotations, anddescription. - Schema contract tests: generate valid, invalid, and boundary inputs from
inputSchema; calltools/callwith each; verify acceptance/rejection matches schema. - Annotation contract tests: for
readOnlyHint, capture state before and after invocation and compare. ForidempotentHint, invoke twice with identical inputs and compare results. FordestructiveHintandopenWorldHint, observe side effects. - Drift detection: compare current test results against baseline snapshot; flag deviations.
- Produce per-tool verdict with separate scores for schema, annotations, and drift.
- Invariant: a tool that fails a schema enforcement test (accepts invalid input or rejects valid input) is a contract violation regardless of annotation results.
- Failure mode: behavioral probes may have false positives if side effects are not observable from the client (e.g., internal logging). Document which annotations can be fully verified and which are partially verified.
Minimal concrete example
tool_contract_test:
tool: "search_docs"
schema_tests:
- input: { query: "test" } # valid, required field present
expect: success
- input: { } # invalid, missing required 'query'
expect: error (INVALID_PARAMS)
- input: { query: "", limit: -1 } # boundary: empty string, negative int
expect: error or graceful handling
- input: { query: "test", extra: 1 } # extra field not in schema
expect: either accepted (open) or rejected (strict)
annotation_tests:
readOnlyHint:
- invoke: { query: "test" }
- check_state_before_after: MUST be identical
- verdict: PASS if no state change, FAIL if state mutated
idempotentHint:
- invoke_1: { query: "test" } -> result_1
- invoke_2: { query: "test" } -> result_2
- verdict: PASS if result_1 == result_2, WARN if results differ
drift_check:
baseline_date: "2026-01-15"
baseline_response_shape: { results: array, total: integer }
current_response_shape: { results: array, total: integer, metadata: object }
verdict: DRIFT_DETECTED (new field 'metadata' not in baseline)
Common misconceptions
- “If the inputSchema is valid JSON Schema, the tool is correctly contracted.” The schema may be valid syntactically but not actually enforced by the implementation. A tool that accepts inputs violating its own schema has a contract gap.
- “Tool annotations are guarantees.” Annotations are hints. The specification explicitly calls them hints (
readOnlyHint, notreadOnlyGuarantee). They must be verified, not trusted. - “Schema validation only needs to check required fields.” Boundary values (empty strings, extreme integers, deeply nested objects, unicode edge cases) reveal contract gaps that simple presence checks miss.
- “If a tool works correctly today, the contract is stable.” Tools evolve. Without drift detection against a baseline, contract violations accumulate silently across versions.
Check-your-understanding questions
- What is the difference between schema validity (the schema itself is well-formed JSON Schema) and schema enforcement (the tool actually accepts/rejects according to the schema)?
- How would you test the
idempotentHintannotation, and what are the limitations of this test? - Why might a
readOnlyHint: truetool still cause observable state changes, and how would your verifier handle this? - What is contract drift, and why is baseline comparison necessary to detect it?
Check-your-understanding answers
- Schema validity means the
inputSchemaobject parses as valid JSON Schema. Schema enforcement means the tool’s runtime behavior matches the schema: rejecting inputs that violate it and accepting inputs that satisfy it. A tool can have a valid schema but poor enforcement. - Invoke the tool twice with identical inputs and compare results. Limitations: the tool may depend on external state that changes between invocations (e.g., a search tool whose index is being updated), producing different results for reasons unrelated to idempotency. Mitigate by using a test environment with stable state.
- The tool might write audit logs, update access timestamps, or increment usage counters. These are side effects even though the tool’s primary function is read-only. The verifier should distinguish primary state (data the tool operates on) from incidental state (logs, metrics) and focus on primary state for the readOnly check.
- Contract drift is when a tool’s behavior changes (new response fields, different error codes, new side effects) without corresponding updates to its schema or annotations. Baseline comparison detects drift by flagging any deviation from a previously recorded known-good state. Without baselines, you only verify current behavior against current declarations, missing the case where both changed together incorrectly.
Real-world applications
- API contract testing (Pact, Spring Cloud Contract) where consumer expectations are verified against provider behavior.
- OpenAPI specification validation tools that check whether a REST API matches its declared schema.
- Terraform provider testing where infrastructure resource behavior is verified against declared schemas.
- GraphQL schema validation that verifies resolver behavior matches type declarations.
Where you’ll apply it
- The tool-specific verification module that generates and runs schema contract tests per tool.
- The annotation verifier that performs behavioral probes for readOnly, idempotent, destructive, and openWorld hints.
- The drift detection system that compares current test results against stored baselines.
- The report section that produces per-tool contract verdicts with specific evidence for each claim.
References
- MCP Specification (modelcontextprotocol.io/specification) - tool declaration format and annotations
- JSON Schema Specification (json-schema.org) - the foundation for inputSchema validation
- “Building Microservices” by Sam Newman - contract testing patterns and consumer-driven contracts
- Pact contract testing documentation - practical contract verification methodology
Key insights A tool’s schema and annotations are claims, not guarantees; the verifier’s job is to test every claim with both positive and negative probes and detect when actual behavior drifts from declared behavior.
Summary MCP Tool Contract Semantics treats tool declarations (schema, annotations, description) as testable contracts. Verification requires schema enforcement tests (does the tool accept/reject according to its schema?), annotation behavioral probes (does the tool actually behave as its hints claim?), and drift detection (has behavior changed since the last verified baseline?).
Homework/Exercises to practice the concept
- Given a tool with
inputSchema: { type: "object", properties: { query: { type: "string" }, limit: { type: "integer", minimum: 1, maximum: 100 } }, required: ["query"] }, list 8 test inputs covering valid, invalid, and boundary cases. - Design a behavioral probe for
idempotentHint: truethat accounts for external state changes between invocations. - Explain why testing
openWorldHintis harder than testingreadOnlyHint, and propose a partial verification strategy. - Write a drift detection specification: what fields do you compare, what constitutes a “drift event,” and what severity levels do you assign?
Solutions to the homework/exercises
Strong solutions include: 8 test inputs such as {query: "test"} (valid), {} (missing required), {query: 123} (wrong type), {query: "test", limit: 0} (below minimum), {query: "test", limit: 100} (at maximum), {query: "test", limit: 101} (above maximum), {query: ""} (empty string boundary), {query: "test", unknown: true} (extra field). Idempotency probes should use a controlled test environment with frozen external state or use a comparison tolerance that accounts for timestamp-like fields. OpenWorld testing is harder because external system interactions are not directly observable from the client; partial verification involves monitoring network traffic or using a proxy to detect outbound calls. Drift detection specs should compare response shape (field names and types), error code set, and side effect profile, with severity levels: CRITICAL (removed field), HIGH (changed type), MEDIUM (added field), LOW (changed error message text).
Security Posture Validation for Integrations
Fundamentals Security Posture Validation for Integrations is the practice of testing the security boundaries of an MCP server integration through systematic probing of authentication flows, input validation, privilege boundaries, and transport security. MCP servers expose tools that can read data, modify state, and interact with external systems. A compromised or poorly secured MCP server becomes an attack vector: a malicious client could exfiltrate data through resource endpoints, perform unauthorized actions through tool calls, or inject payloads through tool arguments that are passed to downstream systems. Security posture validation ensures that the server implementation enforces the security properties that the deployment assumes. For this project, the security test suite is the third pillar of the conformance verifier, alongside protocol conformance and tool contract verification.
Deep Dive into the concept At depth, Security Posture Validation for MCP integrations covers four test categories: authorization flow testing, input validation testing, least privilege verification, and transport security validation.
Authorization flow testing verifies the OAuth 2.1 integration that MCP uses for authenticated access. The MCP specification (2025-11-05) mandates OAuth 2.1 for HTTP-based transports. Your security tests must verify: the server requires authorization before granting access to protected tools/resources, the server correctly validates access tokens (rejecting expired, malformed, or revoked tokens), the server enforces scope restrictions (a token with read scope cannot invoke write tools), and the authorization flow follows OAuth 2.1 requirements (PKCE required, no implicit grant, token rotation). Testing authorization is not just about checking that unauthorized requests get a 401; it requires verifying that the entire flow is correct, including edge cases like token refresh during an active session and behavior when the authorization server is unreachable.
Input validation testing probes whether tool arguments are sanitized before being used in downstream operations. MCP tool arguments are provided by the LLM, which means they are effectively untrusted user input shaped by prompt content. If a tool takes a query argument and passes it directly to a SQL query, a prompt injection attack could cause data exfiltration. Your security tests should send known injection payloads through tool arguments: SQL injection strings, command injection attempts, path traversal sequences, and SSRF-triggering URLs. The expected behavior is that the server either rejects the input with an error or safely handles it without the injection succeeding. This category also includes testing for argument size limits (can you crash the server with a 10MB tool argument?) and encoding attacks (does the server handle null bytes, unicode control characters, and overlong UTF-8 sequences safely?).
Least privilege verification checks that the server only exposes capabilities it needs to. If the server declares tools and resources capabilities but the deployment only requires tool access, the excess resources capability is an unnecessary attack surface. Your tests should verify: the server only responds to methods matching its declared capabilities, tools that are documented as internal-only are not accessible through the public interface, and resources with access restrictions actually enforce those restrictions. This is related to the tool contract tests but focused specifically on the security implications of over-permissioning.
Transport security validation checks the security of the communication channel itself. For Streamable HTTP transport, this means verifying TLS is required (no plaintext HTTP fallback), certificates are valid, and origin checks are enforced (the server rejects requests from unauthorized origins). For SSE transport, the same TLS requirements apply plus verification that the event stream cannot be hijacked by a cross-origin request. For stdio transport (used in local development), security concerns are different: the trust boundary is the process, and the test should verify that the server does not expose network-accessible endpoints.
How this fit on projects This concept provides the security testing module in Project 17. The security test suite runs after protocol conformance and tool contract tests, producing a security posture report that complements the conformance report with security-specific findings.
Definitions & key terms
- Authorization flow: the OAuth 2.1 sequence (token request, validation, refresh, revocation) that controls access to MCP server capabilities.
- Input injection: an attack where malicious data in tool arguments is interpreted as commands by a downstream system (SQL injection, command injection, SSRF).
- Least privilege: the principle that a system should only expose the minimum capabilities required for its function.
- Transport security: protections at the communication layer (TLS encryption, certificate validation, origin checking) that prevent eavesdropping and tampering.
- PKCE (Proof Key for Code Exchange): a mechanism required by OAuth 2.1 that prevents authorization code interception attacks.
- Security posture: the overall security strength of a system, measured by the results of systematic security testing across all attack categories.
Mental model diagram (ASCII)
Security Test Suite
|
+---> [1. Auth Flow Tests]
| - token required for protected endpoints?
| - expired token rejected?
| - scope enforcement correct?
| - PKCE required?
| - token refresh works?
| --> Auth Verdict: PASS / FAIL (per check)
|
+---> [2. Input Validation Tests]
| - SQL injection payloads rejected?
| - command injection blocked?
| - path traversal neutralized?
| - SSRF URLs blocked?
| - oversize input handled?
| --> Input Validation Verdict
|
+---> [3. Least Privilege Tests]
| - undeclared capabilities inaccessible?
| - internal tools not exposed?
| - resource access restrictions enforced?
| --> Privilege Verdict
|
+---> [4. Transport Security Tests]
- TLS required (no HTTP fallback)?
- certificate valid?
- origin checks enforced?
- CORS headers correct?
--> Transport Verdict
|
v
+----------------------------+
| Security Posture Report |
| Auth: 8/9 PASS |
| Input: 12/12 PASS |
| Privilege: 5/5 PASS |
| Transport: 4/4 PASS |
| Overall: 29/30 (1 FAIL) |
| Severity: HIGH (auth gap) |
+----------------------------+
How it works
- Load server’s declared capabilities and transport configuration.
- Run authorization flow tests: attempt access without token, with expired token, with wrong-scope token, with valid token.
- Run input validation tests: send injection payloads through each tool’s arguments and observe responses.
- Run least privilege tests: attempt to access undeclared capabilities and restricted resources.
- Run transport security tests: attempt plaintext connection, check certificate validity, test origin restrictions.
- Produce security posture report with per-category verdicts, severity ratings, and remediation guidance.
- Invariant: any test that demonstrates data exfiltration, unauthorized state modification, or authentication bypass is a CRITICAL severity finding regardless of category.
- Failure mode: some security tests are destructive by nature (testing whether a destructive tool can be invoked without authorization); these must run against a test environment, not production.
Minimal concrete example
security_test_matrix:
auth_flow:
- test: "no_token_access"
request: GET /tools/list (no Authorization header)
expected: 401 Unauthorized
severity_if_fail: CRITICAL
- test: "expired_token"
request: GET /tools/list (Authorization: Bearer <expired>)
expected: 401 Unauthorized
severity_if_fail: HIGH
- test: "wrong_scope_tool_call"
request: POST /tools/call (scope=read, tool=delete_record)
expected: 403 Forbidden
severity_if_fail: CRITICAL
input_validation:
- test: "sql_injection_in_query_arg"
tool: "search_docs"
input: { query: "'; DROP TABLE docs;--" }
expected: error or safe result (NOT SQL execution)
severity_if_fail: CRITICAL
- test: "path_traversal_in_file_arg"
tool: "read_file"
input: { path: "../../../../etc/passwd" }
expected: error (path outside allowed directory)
severity_if_fail: CRITICAL
- test: "ssrf_in_url_arg"
tool: "fetch_url"
input: { url: "http://169.254.169.254/latest/meta-data/" }
expected: error (internal IP blocked)
severity_if_fail: CRITICAL
least_privilege:
- test: "undeclared_resource_access"
precondition: server capabilities do NOT include 'resources'
request: GET /resources/list
expected: error (method not supported)
severity_if_fail: HIGH
transport:
- test: "plaintext_http_rejected"
request: HTTP (not HTTPS) connection attempt
expected: connection refused or redirect to HTTPS
severity_if_fail: HIGH
report_summary:
total_tests: 30
passed: 29
failed: 1
critical_findings: 0
high_findings: 1
finding: "expired_token accepted (auth_flow test 2)"
remediation: "Implement token expiration validation in auth middleware"
Common misconceptions
- “MCP security is the responsibility of the transport layer alone.” TLS protects the channel, but input validation, authorization enforcement, and privilege scoping are application-level concerns that transport security cannot address.
- “Tool arguments come from the LLM so they are trusted.” The LLM generates arguments based on prompt content, which may include adversarial injections. Tool arguments are untrusted input and must be validated as such.
- “If the OAuth flow works in the happy path, authorization is secure.” Authorization security depends on edge cases: expired tokens, revoked tokens, wrong scopes, concurrent sessions, and authorization server unavailability. Happy-path testing is necessary but insufficient.
- “Local development with stdio transport does not need security testing.” Stdio transport shifts the trust boundary to the process level, but the server’s input validation and privilege enforcement should still be correct regardless of transport.
Check-your-understanding questions
- Why are tool arguments in MCP considered untrusted input even though they come from the LLM?
- What is the difference between a 401 and a 403 response, and when should each be returned in the MCP security context?
- How would you test for SSRF vulnerabilities in a tool that accepts URL arguments?
- Why is least privilege verification important even if all tools have correct input validation?
Check-your-understanding answers
- The LLM generates arguments based on prompt content, which includes user input and potentially injected instructions from documents, web pages, or other context sources. An attacker can craft prompt content that causes the LLM to generate malicious tool arguments (e.g., SQL injection strings). The server must validate arguments as if they came from an untrusted external source.
- A 401 (Unauthorized) means the request lacks valid credentials (no token, expired token, malformed token). A 403 (Forbidden) means the request has valid credentials but insufficient permissions (valid token but wrong scope). In MCP: 401 for missing/invalid authentication, 403 for insufficient scope or capability restrictions.
- Send tool arguments containing internal/private IP addresses (169.254.169.254, 10.x.x.x, 127.0.0.1), internal hostnames, and cloud metadata endpoints. Verify that the server either rejects these inputs or fetches them in a sandboxed environment that cannot exfiltrate the results. Also test URL schemes beyond HTTP (file://, gopher://) to ensure they are blocked.
- Even with perfect input validation, unnecessary capabilities increase the attack surface. If a server exposes
resourcescapability when onlytoolsis needed, an attacker who discovers a vulnerability in the resource handling code can exploit it. Least privilege reduces the number of code paths that can be targeted.
Real-world applications
- PCI DSS compliance testing for payment processing integrations (penetration testing, access control verification).
- SOC 2 security control validation for SaaS platforms (access management, input validation, encryption in transit).
- AWS Well-Architected security pillar reviews (least privilege IAM, encryption, network controls).
- OWASP ASVS (Application Security Verification Standard) assessments for web applications.
Where you’ll apply it
- The security test module that runs auth flow, input validation, privilege, and transport tests.
- The injection payload library used for input validation probing.
- The security posture report with per-category verdicts and severity ratings.
- The CI gate that blocks deployment when CRITICAL security findings are present.
References
- MCP Specification - Authorization section (OAuth 2.1 requirements for HTTP transport)
- OWASP Testing Guide - systematic web application security testing methodology
- OWASP LLM Top 10 - security risks specific to LLM-integrated applications
- OAuth 2.1 specification (RFC draft) - authorization flow requirements including PKCE
- CWE (Common Weakness Enumeration) entries for injection (CWE-89), SSRF (CWE-918), and path traversal (CWE-22)
Key insights Security posture validation for MCP integrations must treat tool arguments as untrusted input, verify authorization at every boundary, enforce least privilege on capabilities, and test transport security; a single gap in any category can compromise the entire integration.
Summary Security Posture Validation for Integrations systematically tests four security dimensions of an MCP server: OAuth 2.1 authorization enforcement, input injection resistance, least privilege capability scoping, and transport layer security. The output is a security posture report with severity-rated findings that can gate deployment.
Homework/Exercises to practice the concept
- Design 5 input injection test cases for a tool named
execute_querythat takes asqlstring argument. Include SQL injection, command injection (if the query is passed to a subprocess), and encoding attacks. - Write a test specification for verifying OAuth 2.1 PKCE enforcement: what requests do you send, what do you expect, and what does failure look like?
- Explain the security implications of a tool annotated as
readOnlyHint: truethat actually writes to an audit log. Is this a security finding? What severity? - Design a least privilege test for a server that declares only
toolscapability but also responds toprompts/list. What is the expected behavior and what is the security risk?
Solutions to the homework/exercises
Strong solutions include: injection test cases with specific payloads (e.g., ' OR 1=1--, ; cat /etc/passwd, %00 null byte), PKCE test specs that verify the server rejects authorization requests without code_challenge and rejects token requests with wrong code_verifier, audit log analysis that classifies the readOnly violation as LOW severity if the log is internal-only but MEDIUM if the log is accessible to other users (information disclosure), and least privilege tests that flag prompts/list responding successfully as a HIGH finding because it exposes an undeclared capability that could be exploited.
3. Project Specification
3.1 What You Will Build
A conformance test suite that validates MCP tool/resource behavior against declared contracts and policies.
3.2 Functional Requirements
- Read MCP manifest defining expected tools/resources and policies.
- Run protocol-level handshake and capability checks.
- Validate request/response payloads against declared schemas.
- Emit actionable report with failing contract ids.
3.3 Non-Functional Requirements
- Performance: Full conformance suite under 2 minutes for 20 endpoints.
- Reliability: Deterministic report ordering and stable test ids.
- Security/Policy: Auth boundary tests never leak privileged resource data.
3.4 Example Usage / Output
$ go run ./cmd/p17-mcp-verify --manifest fixtures/mcp-manifest.yaml --target http://localhost:8787/mcp --out out/p17
[INFO] Loaded manifest with 14 tools and 6 resources
[PASS] Handshake protocol checks: 12/12
[PASS] Tool schema conformance: 14/14
[PASS] Auth boundary checks: 9/9
[PASS] Safety rule checks: 11/11
[INFO] Conformance bundle: out/p17/conformance_report.json
3.5 Data Formats / Schemas / Protocols
- Manifest YAML: expected endpoints, schemas, and policy annotations.
- Report JSON: pass/fail per contract test with traces.
- Replay script JSONL for failed cases.
3.6 Edge Cases
- Tool exists but schema fields mismatch declared manifest.
- Server returns partial capability list under load.
- Auth token accepted for wrong privilege scope.
- Resource response shape drifts between versions.
3.7 Real World Outcome
This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.
3.7.1 How to Run (Copy/Paste)
$ go run ./cmd/p17-mcp-verify --manifest fixtures/mcp-manifest.yaml --target http://localhost:8787/mcp --out out/p17
- Working directory:
project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS - Required inputs: project fixtures under
fixtures/ - Output directory:
out/p17
3.7.2 Golden Path Demo (Deterministic)
Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.
3.7.3 If CLI: exact terminal transcript
$ go run ./cmd/p17-mcp-verify --manifest fixtures/mcp-manifest.yaml --target http://localhost:8787/mcp --out out/p17
[INFO] Loaded manifest with 14 tools and 6 resources
[PASS] Handshake protocol checks: 12/12
[PASS] Tool schema conformance: 14/14
[PASS] Auth boundary checks: 9/9
[PASS] Safety rule checks: 11/11
[INFO] Conformance bundle: out/p17/conformance_report.json
$ echo $?
0
Failure demo:
$ go run ./cmd/p17-mcp-verify --manifest fixtures/mcp-manifest.yaml --target http://localhost:9999/mcp --out out/p17
[ERROR] Unable to connect to MCP endpoint: connection refused
[HINT] Start local MCP server with: npm run dev --workspace p17-mcp-server
$ echo $?
2
4. Solution Architecture
4.1 High-Level Design
User Input / Trigger
|
v
+-------------------------+
| Manifest Loader |
+-------------------------+
|
v
+-------------------------+
| Verifier Engine |
+-------------------------+
|
v
+-------------------------+
| Report Generator |
+-------------------------+
|
v
Artifacts / API / UI / Logs
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Manifest Loader | Parses expected protocol contracts. | Manifest is the source of truth for conformance. | | Verifier Engine | Runs handshake/schema/auth/safety checks. | Fail by contract id for precise triage. | | Report Generator | Outputs machine-readable and human-friendly reports. | Support CI gating by severity. |
4.3 Data Structures (No Full Code)
P17_Request:
- trace_id
- input payload/context
- policy profile
P17_Decision:
- status (ALLOW | DENY | RETRY | ESCALATE | PROMOTE | ROLLBACK)
- reason_code
- artifact pointers
4.4 Algorithm Overview
Key algorithm: Policy-aware decision pipeline
- Normalize input and attach deterministic trace metadata.
- Run contract/schema validation and project-specific core checks.
- Apply policy gates and decide: success, retry, deny, escalate, or rollback.
- Persist artifacts and publish operational metrics.
Complexity Analysis (conceptual):
- Time: O(n) over fixture/request items in a batch run.
- Space: O(n) for traces and report artifacts.
5. Implementation Guide
5.1 Development Environment Setup
# 1) Install dependencies
# 2) Prepare fixtures under fixtures/
# 3) Run the project command(s) listed in section 3.7
5.2 Project Structure
p17/
├── src/
├── fixtures/
├── policies/
├── out/
└── README.md
5.3 The Core Question You’re Answering
“Do exposed MCP tools/resources truly conform to their declared contracts and safety rules?”
This question matters because it forces the project to produce objective evidence instead of relying on subjective prompt impressions.
5.4 Concepts You Must Understand First
- Protocol conformance testing
- Why does this concept matter for P17?
- Book Reference: “Computer Networks” by Tanenbaum - protocol validation mindset
- Schema and capability checks
- Why does this concept matter for P17?
- Book Reference: MCP specification + JSON schema validation
- Interoperability diagnostics
- Why does this concept matter for P17?
- Book Reference: Contract testing patterns
5.5 Questions to Guide Your Design
- Boundary and contracts
- What is the smallest safe contract surface for mcp contract verifier?
- Which failure reasons must be explicit and machine-readable?
- Runtime policy
- What is allowed automatically, what needs retry, and what must escalate?
- Which policy checks must happen before any side effect?
- Evidence and observability
- What traces/metrics are required for fast incident triage?
- What specific thresholds trigger rollback or human review?
5.6 Thinking Exercise
Pre-Mortem for MCP Contract Verifier
Before implementing, write down 10 ways this project can fail in production. Classify each failure into: contract, policy, security, or operations.
Questions to answer:
- Which failures can be prevented before runtime?
- Which failures require runtime detection and escalation?
5.7 The Interview Questions They’ll Ask
- “What is the difference between API tests and protocol conformance tests?”
- “How do you structure manifest-driven verification?”
- “Which failures should block release immediately?”
- “How do you make conformance reports useful to developers?”
- “How would you add backward-compatibility testing?”
5.8 Hints in Layers
Hint 1: Start from manifest truth Do not infer contracts from implementation behavior.
Hint 2: Test negative paths Conformance is proven by rejects as much as accepts.
Hint 3: Keep test ids stable Stable ids make trend analysis and CI gating easier.
Hint 4: Bundle replays Developers should replay failing cases quickly.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Network protocol mindset | “Computer Networks” by Tanenbaum | Protocol chapters | | Reliability checks | “Site Reliability Engineering” by Google | Testing/verification chapters | | API contract evolution | “Building Microservices” by Sam Newman | Contract testing chapters |
5.10 Implementation Phases
Phase 1: Foundation
- Define contracts, policy profiles, and deterministic fixtures.
- Build the core execution path and baseline artifact output.
- Checkpoint: One golden-path scenario runs end-to-end with trace id and artifact.
Phase 2: Core Functionality
- Add project-specific evaluation/routing/verification logic.
- Add error paths with unified reason codes.
- Checkpoint: Golden-path and one failure-path both behave deterministically.
Phase 3: Operational Hardening
- Add metrics, trend reporting, and release/rollback or escalation gates.
- Document runbook and incident/debug flow.
- Checkpoint: Team member can reproduce output from clean checkout.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Validation order | Late checks vs early checks | Early checks | Fail-fast saves cost and reduces unsafe execution | | Failure handling | Silent retries vs explicit reason codes | Explicit reason codes | Enables automation and faster debugging | | Rollout/escalation | Manual-only vs policy-driven | Policy-driven with manual override | Balances speed and safety |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate deterministic building blocks | schema checks, policy gates, parser behaviors | | Integration Tests | Verify end-to-end project path | golden-path command/API flow | | Edge Case Tests | Ensure robust failure handling | malformed fixture, blocked policy action |
6.2 Critical Test Cases
- Golden path succeeds and emits expected artifact shape.
- High-risk/invalid path returns deterministic error with reason code.
- Replay with same seed/config yields same decision summary.
6.3 Test Data
fixtures/golden_case.*
fixtures/failure_case.*
fixtures/edge_cases/*
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | “Everything passes locally but fails in CI” | Environment capability mismatch. | Pin server version and manifest in CI fixtures. | | “Schema checks miss real breakage” | Only happy-path payloads are tested. | Add negative and boundary payload tests. | | “Auth tests are flaky” | Token lifecycle not controlled in tests. | Use deterministic mock auth tokens. |
7.2 Debugging Strategies
- Re-run deterministic fixtures with fixed seed and compare trace ids.
- Diff latest artifacts against last known-good baseline.
- Isolate whether failure is contract, policy, or runtime dependency related.
7.3 Performance Traps
- Unbounded retries inflate latency and cost.
- Overly broad logging can slow hot paths.
- Missing cache/canonicalization can create avoidable compute churn.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add one new fixture category and expected outcome labels.
- Add one new reason code with deterministic validation.
8.2 Intermediate Extensions
- Add dashboard-ready trend exports.
- Add automated regression diff against previous run artifacts.
8.3 Advanced Extensions
- Integrate with rollout gates or human approval workflows.
- Add chaos-style fault injection and recovery assertions.
9. Real-World Connections
9.1 Industry Applications
- PromptOps platform teams operating AI features under compliance constraints.
- Internal AI governance tooling for release safety and incident response.
9.2 Related Open Source Projects
- LangChain/LangSmith style eval and tracing workflows.
- OpenTelemetry-based observability stacks for decision traces.
9.3 Interview Relevance
- Demonstrates ability to convert probabilistic model behavior into deterministic software guarantees.
- Shows practical production-thinking: contracts, policies, monitoring, and operational controls.
10. Resources
10.1 Essential Reading
- OpenAI/Anthropic/Google provider docs for structured outputs, tool calling, and prompt controls.
- OWASP LLM Top 10 and NIST AI RMF guidance for safety and governance.
10.2 Video Resources
- Talks on LLM eval systems, PromptOps, and AI safety operations.
10.3 Tools & Documentation
- JSON schema validators, policy engines, and tracing infrastructure docs.
10.4 Related Projects in This Series
- Previous projects: build specialized primitives.
- Next projects: integrate these primitives into broader operational systems.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the core risk boundaries and policy gates for this project.
- I can explain the artifact format and why each field exists.
- I can justify the release/escalation criteria.
11.2 Implementation
- Golden-path and failure-path flows both work.
- Deterministic artifacts are produced and reproducible.
- Observability fields are present for debugging and audits.
11.3 Growth
- I can describe one tradeoff I made and why.
- I can explain this project design in an interview setting.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Golden path works with deterministic output artifact.
- At least one failure-path scenario returns unified error shape/reason code.
- Core metrics are emitted and documented.
Full Completion:
- Includes automated tests, trend reporting, and reproducible runbook.
- Includes operational thresholds for promote/rollback or escalate/approve.
Excellence (Above & Beyond):
- Integrates with adjacent projects (registry, rollout, firewall, HITL) cleanly.
- Demonstrates incident drill replay and fast root-cause workflow.