Project 2: Schema-First Contract Extractor
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | 3 |
| Time | 2 weeks |
| Main Stack | Elixir + req_llm + jido_action (optional) |
| Alternatives | JSON schema validators + direct APIs |
| Why Now | Removes brittle unstructured LLM output dependencies |
What You Will Build
An extraction service that converts free-form input (text, email, incident message, call transcript) into a typed contract using req_llm structured output, then emits a contract object your agents can execute deterministically.
Real World Outcome
Output from the extractor is auditable and machine-consumable:
$ mix run -e "ExtractorDemo.extract(\"incident\")"
[info] contract_type=incident_payload
[info] schema=IncidentSchema v2.1
[info] validation=ok
[info] result=%{
title: "Database backup failed",
severity: :high,
service: "postgres-prod",
step: :retry_pending,
evidence: ["pg_basebackup: broken pipe", "exit code 32"]
}
[info] extraction_ms=840 usage_cost=0.0021
The Core Question You Are Answering
“How do we force LLM output to match a business schema so downstream workflows stay deterministic?”
Why This Project Matters
Without structured contracts, every model update risks changing output shape. req_llm’s schema-based object mode lets you treat LLM responses like typed records, which is essential for automations, governance, and handoffs to Jido agents.
Minimal Example (Pseudocode)
schema = [title: ..., severity: ..., service: ..., step: ...]
response = ReqLLM.generate_object!(model, prompt, schema, structured_mode: :json_schema)
contract = Response.object(response)
if contract_is_valid?(contract, schema_version) then enqueue_to_agent(contract)
Diagram
Input Artifact
|
v
+------------------------+
| Prompt Normalizer |
| - strip noise |
| - normalize timestamps |
| - add extraction hints |
+-----------+------------+
|
v
+------------------------+
| req_llm.generate_object |
| schema + provider |
| options + fallback |
+-----------+------------+
|
v
+-----+-----------------+
| Contract Validation |
| schema version checks |
| compatibility gates |
+----+------------------+
|
valid|invalid
|
v v
Agent Queue Dead Letter
| |
v v
FSM / Next Steps Operator Alert
Deep Dive Steps
1. Define Versioned Contracts
- Start with 3 levels:
v1: base fields onlyv1.1: optional metadatav2: required identifiers + severity enum
- Never change required fields in place; add optional fields first.
2. Create Schema to Contract Map
- Create mapping for:
- extraction prompt templates
- required keys
- field normalization rules
- Keep default values explicit to avoid nil storms.
3. Use generate_object as Gate
- Always call structured-object mode from first parse.
- Reject unparseable outputs and request a re-run with stricter constraints.
4. Error Taxonomy
- Distinguish parse failures, schema violations, and semantic rule failures.
- Parse failures may retry same prompt with reduced creativity.
- Semantic failures should be manually triaged.
5. Integration with Agentic Workflow
- Produce
contract_type+ payload +schema_version. - Feed contract into a jido-based orchestrator later (Project 6).
Concepts You Must Understand First
- Schema-driven interfaces
- Why optional vs required fields affect rollout safety.
- Structured object mode
- req_llm maps schemas into provider-specific JSON modes.
- Dead-letter strategy
- Every non-conforming extraction should have a recovery lane.
- Contract evolution
- How versioned schemas reduce breaking changes.
Questions to Guide Your Design
- Schema boundaries
- Which fields are stable in all inputs?
- Which fields must always be inferred from evidence?
- Retry strategy
- Are retries safe for each field type?
- Which errors should escalate directly?
- Downstream impact
- How does one field type change alter downstream action selection?
Thinking Exercise
Given a noisy support email, hand-draw the exact schema fields it should produce.
Questions:
- Which value is most likely to be wrong first (service, severity, action)?
- Where would you detect ambiguity?
- How do you annotate uncertainty without blocking progress?
Interview Questions They Will Ask
- “Why is
generate_objectnot just JSON mode?” - “How do you validate semantic consistency after schema validation?”
- “What do you do with schema drift from provider to provider?”
- “How do you avoid silent data loss in extracted contracts?”
- “What is your DLQ behavior and why?”
Hints in Layers
Hint 1: Start with immutable schema contract
Write one schema file and freeze a v1 baseline.
Hint 2: Add explicit coercion layer Handle enums and date parsing outside the model.
Hint 3: Keep extractor and validator independent If schema changes, replace prompt and schema only; keep transport logic untouched.
Hint 4: Use reproducible fallback prompts Retry with stricter constraints: “No assumptions, only schema fields from input evidence.”
Common Pitfalls and Debugging
- Problem:
generate_objectreturns valid but semantically wrong.- Why: weak prompt contract.
- Fix: add schema examples and anti-ambiguity guardrails.
- Quick test: compare output against golden samples.
- Problem: Contract shape changes break downstream.
- Why: mixed schema versions in queue.
- Fix: enforce version compatibility at sink boundary.
- Quick test: inject synthetic
v1andv2payloads.
- Problem: LLM retries hide systemic drift.
- Why: repeated retries without prompt guardrails.
- Fix: escalate with trace IDs after 2 failures.
- Quick test: assert one incident emits one dead-letter record.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Contracts and APIs | Designing Data-Intensive Applications | Data Encoding and Evolution |
| Practical Elixir | Elixir in Action | Error Handling in Actor Systems |
Definition of Done
- Contract schema versioning is in place and documented
- extractor returns typed objects for all supported input classes
- parse/semantic/fatal failures are observable and recoverable
- dead-letter queue receives malformed cases with reason codes
- extraction service consumed by at least one agent flow
References
- https://hexdocs.pm/req_llm/1.5.1/overview.html