Project 2: Schema-First Contract Extractor

Quick Reference

Attribute	Value
Difficulty	3
Time	2 weeks
Main Stack	Elixir + req_llm + jido_action (optional)
Alternatives	JSON schema validators + direct APIs
Why Now	Removes brittle unstructured LLM output dependencies

What You Will Build

An extraction service that converts free-form input (text, email, incident message, call transcript) into a typed contract using req_llm structured output, then emits a contract object your agents can execute deterministically.

Real World Outcome

Output from the extractor is auditable and machine-consumable:

$ mix run -e "ExtractorDemo.extract(\"incident\")"
[info] contract_type=incident_payload
[info] schema=IncidentSchema v2.1
[info] validation=ok
[info] result=%{
  title: "Database backup failed",
  severity: :high,
  service: "postgres-prod",
  step: :retry_pending,
  evidence: ["pg_basebackup: broken pipe", "exit code 32"]
}
[info] extraction_ms=840 usage_cost=0.0021

The Core Question You Are Answering

“How do we force LLM output to match a business schema so downstream workflows stay deterministic?”

Why This Project Matters

Without structured contracts, every model update risks changing output shape. req_llm’s schema-based object mode lets you treat LLM responses like typed records, which is essential for automations, governance, and handoffs to Jido agents.

Minimal Example (Pseudocode)

schema = [title: ..., severity: ..., service: ..., step: ...]
response = ReqLLM.generate_object!(model, prompt, schema, structured_mode: :json_schema)
contract = Response.object(response)
if contract_is_valid?(contract, schema_version) then enqueue_to_agent(contract)

Diagram

Input Artifact
    |
    v
+------------------------+
| Prompt Normalizer      |
| - strip noise          |
| - normalize timestamps  |
| - add extraction hints  |
+-----------+------------+
            |
            v
+------------------------+
| req_llm.generate_object |
| schema + provider       |
| options + fallback      |
+-----------+------------+
            |
            v
      +-----+-----------------+
      | Contract Validation     |
      | schema version checks   |
      | compatibility gates     |
      +----+------------------+
           |
      valid|invalid
           |
      v    v
   Agent Queue   Dead Letter
       |            |
       v            v
 FSM / Next Steps  Operator Alert

Deep Dive Steps

1. Define Versioned Contracts

Start with 3 levels:
- v1: base fields only
- v1.1: optional metadata
- v2: required identifiers + severity enum
Never change required fields in place; add optional fields first.

2. Create Schema to Contract Map

Create mapping for:
- extraction prompt templates
- required keys
- field normalization rules
Keep default values explicit to avoid nil storms.

3. Use `generate_object` as Gate

Always call structured-object mode from first parse.
Reject unparseable outputs and request a re-run with stricter constraints.

4. Error Taxonomy

Distinguish parse failures, schema violations, and semantic rule failures.
Parse failures may retry same prompt with reduced creativity.
Semantic failures should be manually triaged.

5. Integration with Agentic Workflow

Produce contract_type + payload + schema_version.
Feed contract into a jido-based orchestrator later (Project 6).

Concepts You Must Understand First

Schema-driven interfaces
- Why optional vs required fields affect rollout safety.
Structured object mode
- req_llm maps schemas into provider-specific JSON modes.
Dead-letter strategy
- Every non-conforming extraction should have a recovery lane.
Contract evolution
- How versioned schemas reduce breaking changes.

Questions to Guide Your Design

Schema boundaries
- Which fields are stable in all inputs?
- Which fields must always be inferred from evidence?
Retry strategy
- Are retries safe for each field type?
- Which errors should escalate directly?
Downstream impact
- How does one field type change alter downstream action selection?

Thinking Exercise

Given a noisy support email, hand-draw the exact schema fields it should produce.

Questions:

Which value is most likely to be wrong first (service, severity, action)?
Where would you detect ambiguity?
How do you annotate uncertainty without blocking progress?

Interview Questions They Will Ask

“Why is generate_object not just JSON mode?”
“How do you validate semantic consistency after schema validation?”
“What do you do with schema drift from provider to provider?”
“How do you avoid silent data loss in extracted contracts?”
“What is your DLQ behavior and why?”

Hints in Layers

Hint 1: Start with immutable schema contract Write one schema file and freeze a v1 baseline.

Hint 2: Add explicit coercion layer Handle enums and date parsing outside the model.

Hint 3: Keep extractor and validator independent If schema changes, replace prompt and schema only; keep transport logic untouched.

Hint 4: Use reproducible fallback prompts Retry with stricter constraints: “No assumptions, only schema fields from input evidence.”

Common Pitfalls and Debugging

Problem: generate_object returns valid but semantically wrong.
- Why: weak prompt contract.
- Fix: add schema examples and anti-ambiguity guardrails.
- Quick test: compare output against golden samples.
Problem: Contract shape changes break downstream.
- Why: mixed schema versions in queue.
- Fix: enforce version compatibility at sink boundary.
- Quick test: inject synthetic v1 and v2 payloads.
Problem: LLM retries hide systemic drift.
- Why: repeated retries without prompt guardrails.
- Fix: escalate with trace IDs after 2 failures.
- Quick test: assert one incident emits one dead-letter record.

Books That Will Help

Topic	Book	Chapter
Contracts and APIs	Designing Data-Intensive Applications	Data Encoding and Evolution
Practical Elixir	Elixir in Action	Error Handling in Actor Systems

Definition of Done

Contract schema versioning is in place and documented
extractor returns typed objects for all supported input classes
parse/semantic/fatal failures are observable and recoverable
dead-letter queue receives malformed cases with reason codes
extraction service consumed by at least one agent flow

References

https://hexdocs.pm/req_llm/1.5.1/overview.html