ADR 015: Temporal as the Orchestration Substrate
Author: jomcgi Status: Accepted Created: 2026-05-30 Supersedes: 014 — AX + Substrate Agent RuntimeDepends on: 016 — NATS as Canonical Event Stream, 017 — Domain Event Schema
Problem
ADR 014 committed to AX + Substrate as the agent runtime stack. Two weeks of design work since then surfaced concerns that don't justify continuing on that path:
Maturity. AX is v0.1.0 (2026-05-20, "pausing external PRs while we stabilize core architecture") and Substrate is v0.0.0 (2026-05-19, literal initial commit, "APIs almost guaranteed to change"). Both projects are explicitly pre-stability; the homelab would be the production validation surface. ADR 014 accepted this risk with the mitigation of pin-to-SHA and budget half-day/month for churn — workable, but the actual blast radius is wider than that estimate when both projects are evolving simultaneously.
Multiplexing isn't load-bearing for current workloads. Substrate's headline feature is 30× actor oversubscription via pod multiplexing + sub-second snapshot/resume. This earns its keep when workloads are many long-lived idle actors (chat sessions, per-user agents). The current homelab workload is bounded-concurrency, actively-running gap-drain and cron jobs — none of which sit idle waiting for events. The multiplexing complexity buys nothing today.
The substrate abstraction we needed turns out to be smaller than two upstream projects. What we actually need is: (a) durable workflow execution with retries/heartbeats/replay; (b) cron scheduling unified with workflow execution; (c) horizontal worker scaling driven by queue depth; (d) workflow identity that survives pod rollover. Temporal provides all four out of the box, with eight years of production maturity, native Postgres backing, and a thoroughly debugged failure model.
The ADR 014 "delete a Go service, adopt two upstream projects" trade has the right instinct (less code we own) but the wrong substitution. We can delete the orchestrator + cluster_agents Go services without adopting pre-1.0 upstream projects, by leaning on a mature workflow engine.
Proposal
Adopt Temporal as the orchestration substrate for all agent workflows, cron-scheduled jobs, and long-running async work in the homelab. Deploy in-cluster via the official temporalio/helm-charts, backed by the existing CNPG Postgres cluster (separate database, same instance).
| Layer | ADR 014 (AX + Substrate) | This ADR |
|---|---|---|
| Workflow engine | AX (v0.1.0, pre-stability) | Temporal (Apache 2.0, production-mature, 2017→) |
| Dispatch | AX event log (gRPC, durable, resumable) | Temporal task queues (gRPC, durable, replayable) |
| Cron scheduling | Monolith register-routine-job → AX submit | Temporal ScheduleClient — first-class workflow with cron spec |
| Pod multiplexing | Substrate (ateapi / atelet / atecontroller / ateom-gvisor) | Plain k8s Deployments + KEDA — scale workers per task queue |
| Snapshotting | Substrate gVisor checkpoint/restore | Not implemented (workflow durability via Temporal event history) |
| Sandbox kernel | gVisor via Substrate | gVisor still available via security/003 RuntimeClass when needed |
| Persistence | AX event log in postgres + Substrate worker pool state | Single Postgres DB (Temporal's state) — no separate event log |
| Harnesses | Goose recipes, Claude CLI subprocess | Goose recipes, Claude CLI subprocess (unchanged) |
| Tool gateway | Context Forge | Context Forge (unchanged) |
The seam between domain code and orchestration shrinks too: workflows are first-class Python code in projects/monolith/monolith/orchestrator/, not a separate Go service with a gRPC adapter.
Architecture
graph TB
User[Joe / MCP Client] -->|MCP tool call| Mono
Webhook[External webhooks] -->|HTTPS| Mono
subgraph "Monolith (Control Plane)"
Mono[Monolith HTTP API]
Schedules[Schedules registered<br/>on monolith startup]
end
Mono -->|start_workflow| TemporalCluster
Schedules -->|create_schedule| TemporalCluster
subgraph "Temporal (own namespace)"
TemporalCluster[Temporal Cluster<br/>frontend · history · matching · worker UI<br/>Postgres-backed]
end
TemporalCluster -->|task queue dispatch| WorkerPools
subgraph "Worker Pools (per task queue)"
GD[gap-drain-worker]
WF[weather-fetch-worker]
HK[housekeeping-worker]
IB[iceberg-builder-worker]
end
KEDA[KEDA<br/>scale on queue depth] -->|min=0 max=N| WorkerPools
GD -->|Goose / Claude CLI<br/>per recipe| Harness[Existing harness<br/>preserved from ADR 014]
WF -->|HTTP / DB activities| External[External APIs / DBs]
HK -->|gardener / vault backup| External
IB -->|Iceberg writes| Storage[(SeaweedFS<br/>Iceberg warehouse)]
style TemporalCluster fill:#326CE5,color:#fff
style KEDA fill:#326CE5,color:#fff
style WorkerPools fill:#326CE5,color:#fffWorkflow identity = work identity
Workflows are keyed to the thing being done, not to the cron firing that scheduled them. Per-entity workflow IDs (e.g., gap-drain-{gap_id}) give native deduplication via Temporal's WorkflowAlreadyStartedError semantics. The same entity can be retriggered (manually, event-driven, or via cron sweep) without risk of double-execution.
flowchart TB
subgraph Triggers
EventDriven[Event-driven<br/>monolith publishes<br/>gap-ready event]
CronSweep[Cron sweep<br/>every 5min:<br/>find ready gaps<br/>without active workflows]
Manual[Manual<br/>Joe via UI/CLI]
end
EventDriven -->|start_workflow<br/>id=gap-drain-42| Temporal
CronSweep -->|start_workflow<br/>id=gap-drain-42| Temporal
Manual -->|start_workflow<br/>id=gap-drain-42| Temporal
Temporal{Already<br/>running?}
Temporal -->|yes| NoOp[WorkflowAlreadyStartedError<br/>silently swallowed]
Temporal -->|no| Schedule[Schedule on task queue]
Schedule --> Worker[gap-drain-worker pod<br/>picks up activity]
style NoOp fill:#fff3e0
style Schedule fill:#e8f5e9Three independent trigger paths converge on the same deterministic workflow ID. Whichever fires first wins; subsequent attempts are silent no-ops. No coordination between triggers needed.
Worker pools, not "the orchestrator"
There is no single "agent orchestrator" pod. Each task queue gets its own Deployment of long-lived worker pods that poll Temporal for activities to execute. KEDA scales each pool to queue depth (including scale-to-zero for bursty queues).
graph TB
subgraph "Temporal task queues"
Q1[gap-drain queue]
Q2[weather-fetch queue]
Q3[housekeeping queue]
Q4[iceberg-builder queue]
end
KEDA[KEDA scalers<br/>poll queue depth]
KEDA -.->|min=2 max=10| Q1
KEDA -.->|min=0 max=2| Q2
KEDA -.->|min=1 max=2| Q3
KEDA -.->|min=0 max=1| Q4
Q1 -->|poll| GD1[gap-drain pod 1]
Q1 --> GD2[gap-drain pod 2]
Q1 -.->|scale up<br/>on backlog| GD3[gap-drain pod N]
Q2 -.->|idle: 0 pods| Empty1[no pods]
Q2 -->|work arrives| WF[weather-fetch pod]
Q3 --> HK[housekeeping pod]
Q4 -.->|idle: 0 pods| Empty2[no pods]
style KEDA fill:#7CCE53,color:#000
style Empty1 fill:#f5f5f5,color:#999
style Empty2 fill:#f5f5f5,color:#999Bursty queues (weather-fetch, iceberg-builder) scale to zero when idle and back up on demand. Persistent queues (gap-drain) keep a minimum of replicas always warm. Each pool is tuned independently — KEDA's per-ScaledObject configuration means we don't pay for warm capacity we don't need.
This replaces ADR 014's split-roles plan (monolith → AX → Substrate → warm pool). The seam is one layer (workflow code in monolith's Python package → Temporal task queue → worker pods running activities) instead of three.
Scheduling unified with execution
Cron-style schedules (gardener, calendar poll, vault backup, weather fetch, Iceberg batch commit, Iceberg builder) become Temporal Schedules, registered idempotently from monolith startup. Same Postgres persistence, same retry policy, same UI as event-triggered workflows. The monolith's per-domain scheduler module shrinks to "register Temporal Schedules from a SCHEDULES list at boot."
What gets deleted
| Today / per ADR 014 plan | After |
|---|---|
projects/agent_platform/orchestrator/ Go service | Deleted (was already slated for deletion by ADR 014) |
projects/agent_platform/cluster_agents/ Go service + five autonomous loops | Deleted; each loop becomes a Temporal Schedule + workflow |
| Custom claim / heartbeat / reaper / DLQ machinery in monolith | Deleted — Temporal's native primitives replace |
| Per-domain advisory lock dance for any work coordination | Deleted — Temporal task queues handle worker coordination |
| AX deployment manifests + gRPC adapter (not yet shipped, planned by ADR 014) | Not built |
Substrate deployment + atecontroller + atelet daemonset (planned by ADR 014) | Not built |
agent-sandbox warm-pool chart (per ADR 014's transition plan) | Stays for now; revisit if Temporal-based worker pools cover all use cases |
What does NOT change
- Goose recipes in
projects/agent_platform/goose_agent/image/recipes/— preserved as agent behaviour per ADR 010 - Claude CLI subprocess pattern — preserved per
feedback_claude_cli_subprocess_for_tos.md; ToS-compliant under Claude Max subscription - Context Forge as the MCP gateway per ADR 003
- vLLM + Qwen3.5 hybrid — inference plane unchanged; orchestration is orthogonal to inference choice
- MCP surface (
monolith-agent-*tools) — external interface preserved; internal implementation migrates to Temporal SDK calls - gVisor availability — security/003 RuntimeClass remains available for any workload that needs strong isolation; Temporal worker pods can be tagged with it per Deployment
Worker pod lifecycle (the property that matters)
Temporal handles workflow lifecycle so completely that pod lifecycle becomes a separate, simpler concern:
sequenceDiagram
participant T as Temporal
participant W1 as Worker pod 1
participant W2 as Worker pod 2
participant A as Activity
T->>W1: Schedule activity X
W1->>A: Execute activity X
A->>W1: heartbeat (every 5min)
W1->>T: heartbeat → extend lease
Note over W1: Pod evicted /<br/>OOM killed / rolled
W1--xT: heartbeat stops
T->>T: Heartbeat timeout expires
T->>T: Mark activity for retry
T->>W2: Schedule activity X (retry)
W2->>A: Execute activity X (new attempt)
A->>W2: heartbeat
W2->>T: heartbeat → extend lease
A->>W2: Complete
W2->>T: Activity result
T->>T: Workflow continues- Workflow durability: workflow state survives any worker pod crash (event history is the source of truth, replayed on recovery)
- Activity heartbeats: detect dead workers; reassign in-flight work to live workers
- Worker pool: managed by k8s Deployment + KEDA; rolls happen on chart updates, scaling happens on queue depth
- Result: kill any worker pod at any time, workflows keep working
This eliminates the entire class of "what happens if the orchestrator pod restarts mid-job?" concerns that ADR 008 was bug-fixing. Those resilience patterns are now Temporal's responsibility, not ours.
Security
- Temporal is internal-only. The frontend gRPC API is reachable only from monolith pods and worker pods within the cluster. No Cloudflare exposure for the Temporal API. UI is exposed via the existing internal-only ingress for admin use.
- Worker pod identity uses dedicated ServiceAccounts per task queue, with RBAC scoped to that workload's needs (e.g., gap-drain worker needs Postgres read access for KG; iceberg-builder worker needs SeaweedFS write).
- Authentication into Temporal uses internal mTLS (Linkerd-managed) plus Temporal's namespace authorization. No external clients.
- Credentials (Anthropic API key, NIM API key, S3 credentials) injected via 1Password Operator into worker pods at deploy time.
- No new ingress introduced by this ADR. All external clients continue through monolith → Cloudflare → CF Tunnel.
- gVisor isolation remains available via security/003 RuntimeClass for workloads that need it (e.g., agents executing LLM-generated code). Temporal worker pods can be selectively tagged with the runsc runtime class via Deployment spec — no platform-wide change required.
- Workflow history visibility: per-namespace authorization in Temporal allows scoping who can query/replay/cancel workflows. For homelab single-user, namespace=
defaultis sufficient.
See docs/security.md for baseline. No deviations introduced.
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Temporal becomes critical-path infrastructure | Certain | High | Production-mature project; Helm chart well-tested; Postgres backing reuses existing CNPG cluster. Operational burden lower than AX+Substrate would have been. |
| Workflow code determinism violations (non-deterministic constructs in workflow) | Medium | Medium | Temporal SDK enforces; documented patterns ("I/O in activities, computation in workflows"); CI lint can catch common cases |
| Per-entity workflow ID collisions on retry | Low | Low | WorkflowIDReusePolicy.ALLOW_DUPLICATE_FAILED_ONLY semantics handle natively; cron sweep is idempotent via WorkflowAlreadyStartedError |
| Activity retry storms on external dependency outages | Medium | Medium | Exponential backoff with high maximum_interval (1h+); next_retry_delay from Retry-After headers; circuit-breaker fallback patterns |
| Temporal version upgrades require schema migrations | Low | Medium | Helm chart handles automatically; pg_dump before major version bumps as belt-and-suspenders |
| Worker pool resource sizing miscalibrated | Medium | Low | KEDA scales to demand; per-task-queue Deployments allow per-workload tuning; in-place vertical scaling available for long-lived workers |
| Cron schedule drift if monolith fails to register schedules on startup | Low | Low | Schedule registration is idempotent; missed run causes one delayed execution, not silent loss |
| Loss of Substrate's snapshot/resume capability for future workloads | Medium | Low | If/when a workload genuinely needs multiplexing, can integrate Substrate (or Modal/E2B/Daytona) as a substrate adapter at that point |
Open Questions
These are questions to answer during execution, not gates that block the decision.
- Should Temporal's
temporalandtemporal_visibilitydatabases share the existing CNPG cluster (same instance, different DBs) or get their own instance? For homelab scale, shared is fine; revisit if Temporal throughput pressures the existing cluster. - Does Temporal's standard visibility (Postgres-backed) suffice, or do we need Elasticsearch-backed advanced visibility? Default to standard; revisit if cross-workflow search queries become important.
- What retention policy for completed workflow histories? Default 7-14 days for most workflows; longer for audit-relevant ones (e.g., KG mutations).
- Goose recipe ergonomics inside Temporal activities — does the existing
goose runinvocation pattern fit cleanly inside an activity, or do we need a small adapter? Probably fine; verify with first migrated workflow. - Per-workflow OTel trace correlation through Temporal SDK — does the Python SDK auto-propagate trace context to activities? If not, manual propagation pattern needed early.
- KEDA's Temporal scaler queue-depth metric — does it work cleanly with very low absolute volumes (single-digit messages)? Probably yes; verify before relying on scale-to-zero behavior.
References
| Resource | Relevance |
|---|---|
| Temporal | The workflow engine being adopted |
| temporalio/helm-charts | Deployment mechanism |
| KEDA Temporal scaler | Queue-depth-driven worker pool scaling |
| 014 — AX + Substrate Agent Runtime | Superseded by this ADR |
| 007 — Agent Run Orchestration Service | Dispatch plumbing already retired by ADR 014; this ADR confirms |
| 008 — Cluster Patrol Loop Resilience | Autonomous-loop plumbing already retired by ADR 014; this ADR confirms |
| 010 — Recipe-Driven Agent Registry | Goose recipes preserved as agent behaviour |
| 003 — Context Forge | MCP gateway, unchanged |
| 016 — NATS as Canonical Event Stream | Companion: events between system components |
| 017 — Domain Event Schema | Companion: shape of events flowing through NATS |
| platform/004 — Iceberg Lakehouse + Hot-Swap Quack Serving | Companion: storage architecture |