Wire protocol
The HTTP contract between the Durablex engine and a runner.
You don't need this to build workflows - the SDK handles all of it. This reference is for people building or porting an SDK, or integrating with the engine at the protocol level.
This is the contract between the engine and a runner (your code plus an SDK). It covers the five step operations - running a step, sleeping, waiting for an event, invoking a child workflow, and emitting an event - over HTTP. More transports are planned; the shapes here are designed to grow without breaking.
How the engine drives a workflow
The engine never holds your workflow in memory. It makes progress by calling your runner's
/invoke endpoint, once per step, sending along the results of every step that has already
completed. Your handler runs from the top each time: completed steps return their saved result, and
the first unfinished step does real work and reports back. The engine saves that result and calls
again, until the handler returns.
This is why a runner is stateless and a run survives an engine restart: all progress lives in the engine's store and is replayed to the runner on each call.
Endpoints
| Method + path | On | Purpose |
|---|---|---|
POST /register | engine | Runner announces its app, optional stable runner id, workflows, and invoke URL. |
POST /events | engine | Ingest an event: resume any waitForEvent waiters and fan out to every workflow whose triggers match (optionally pinned to a runner). Returns 202 {runId?, woke, triggered}. |
GET /runs, GET /runs/{id}, GET /runs/{id}/steps | engine | Inspect runs and steps. |
GET /workflows | engine | List registered workflow definitions (name, app, retry policy, and scheduled + schedules[] for cron workflows). |
GET /runners | engine | List registered runners (id, app, url, runtime/version, last-seen, live). Filter ?app=. Backs the console's connected-runners view. |
GET /events, GET /events/{id} | engine | The event log: each ingested event with what it triggered/woke. Filter ?app=&name=&limit=, newest first. |
GET /events/stream | engine | Live tail of ingested events as Server-Sent Events (text/event-stream). |
POST /invoke | runner | The engine drives one pass. Returns 200 (done) or 206 (more work). |
GET /connect | engine | WebSocket upgrade for the Connect transport. A runner with no inbound URL dials this, registers, and receives invokes over the socket. |
The runner's invoke URL is whatever it advertises in /register - unless it connects over the Connect
transport (see below), in which case it has no inbound URL at all.
Messages
POST /register (runner → engine)
{
"app": "order-app",
"runner": "node-7",
"url": "http://localhost:6773/invoke",
"runtime": "bun",
"language": "typescript",
"version": "0.1.0",
"protocolVersion": 1,
"workflows": [
{
"name": "order.created",
"retry": { "maxAttempts": 3 },
"concurrency": { "limit": 5, "key": "accountId" }
}
]
}| Field | Type | Description |
|---|---|---|
app | string | App the runner serves. |
runner | string (optional) | Stable runner id; omitted → the engine keys the endpoint by url. |
url | string | The runner's /invoke endpoint. |
runtime / language / version | string (optional) | Handshake metadata describing the runner (e.g. bun / typescript / the SDK version), surfaced in the console's connected-runners view. The SDK sends them automatically. |
protocolVersion | number (optional) | The wire version the runner speaks (see Protocol version). Omitted → assumed compatible. |
workflows | object[] | Each entry has a name plus optional triggers, retry, onFailure (true if the workflow has an onFailure handler), and flow-control fields (see Flow Control). |
workflows[].triggers | object[] (optional) | What starts the workflow: event triggers { event, if? } (event may end in a * wildcard; if is a CEL filter) and cron triggers { cron }. Omitted → an implicit event trigger on the workflow name. See Triggers. |
POST /events (caller → engine)
{ "name": "order.created", "app": "order-app", "runner": "node-7", "dedupeId": "evt-A1", "data": { "orderId": "A1" } }| Field | Type | Description |
|---|---|---|
name | string | Event name; matched against every workflow's event triggers (exact or * wildcard) and resumes awaiting waitForEvent steps. |
app | string | Target app. |
runner | string (optional) | Present → pin the run to that runner; absent → anycast. |
dedupeId | string (optional) | A repeat of the same id (per app) within 24h is dropped entirely - no waiters woken, no fan-out, no new log row - so an at-least-once caller can safely retry. The response is 202 { "deduped": true }. |
data | JSON | Event payload. Any valid JSON (object, array, or scalar); may be omitted. |
name is required (non-blank) and length-bounded, as are app, runner, targetApp, and dedupeId;
data, when present, must be valid JSON. A request that violates any of these is rejected with 400.
The 202 response reports what the event did:
{ "runId": "01H...", "woke": 0, "skipped": false, "dropped": false, "debounced": false, "batched": false, "deduped": false }| Field | Type | Description |
|---|---|---|
runId | string | Set if the event triggered a run; mirrors the first matched workflow (sorted by name) for the common single-match case. |
woke | number | waitForEvent runs resumed (broadcast). |
skipped | boolean | A singleton skip policy dropped the trigger. |
dropped | boolean | A rateLimit policy shed the trigger. |
debounced | boolean | Coalesced into a debounce buffer. |
batched | boolean | Buffered into a batch. |
deduped | boolean | The event was a duplicate dedupeId, or a workflow's idempotency key was already seen in its window. |
triggered | object[] | One entry per workflow the event matched (event triggers can fan out); each has workflow, runId?, and the gate booleans. |
POST /invoke (engine → runner)
{
"event": { "name": "order.created", "data": { "orderId": "A1" } },
"steps": { "9f2b8c...": { "data": { "chargeId": "ch_A1" } } },
"ctx": { "runId": "01H...", "workflow": "fulfillment", "attempt": 1, "app": "order-app", "runner": "node-7" }
}| Field | Type | Description |
|---|---|---|
event | object | The triggering event (name, data); name is informational and may differ from the workflow. |
steps | object | Memo map: hashed step id → its saved state. |
ctx.runId | string | The run being replayed. |
ctx.workflow | string | The dispatch key: the registered workflow name the runner routes to (distinct from event.name). |
ctx.attempt | number | Run-level attempt counter. |
ctx.app | string | The run's app. |
ctx.runner | string | The run's pin; "" for an anycast run. |
ctx.onFailure | boolean (optional) | Set on an onFailure invocation; the runner dispatches to the workflow's onFailure handler. |
ctx.error | StepError (optional) | The terminal error, present only when ctx.onFailure is set. |
ctx.traceparent | string (optional) | W3C trace context of the engine's invoke span; the runner extracts it to nest its spans in the run's distributed trace. Rides the body (not a header) so it propagates the same over HTTP and the Connect WebSocket. |
Each steps entry carries one of:
| Field | Type | Description |
|---|---|---|
data | JSON | A completed step's result. |
error | StepError | A completed step that threw. |
pending | boolean | Step already started (a parked sleep / wait / child); the runner blocks on it without re-running or re-emitting. |
The runner replies 200 { "data": <result>, "logs": [LogLine, ...] } when the run completes, or
206 { "opcodes": [Opcode, ...], "logs": [LogLine, ...] } listing the steps discovered this pass. The
logs array carries any ctx.log lines captured during the pass (see Logs); it is [] when
none were emitted.
Opcode
{ "op": "StepRun", "id": "9f2b8c...", "name": "charge", "data": { "chargeId": "ch_A1" } }| Field | Type | Used by | Description |
|---|---|---|---|
op | enum | all | StepRun | Sleep | SleepUntil | WaitForEvent | RunWorkflow | Emit | Webhook. |
id | string | all | Hashed step id; the engine stores the result under this key. |
name | string | all | Human-readable step id (for the console). |
data | JSON | StepRun, Emit, Webhook | Step result / event payload / webhook body. |
error | StepError | StepRun | Step failure. |
retriable | boolean (optional) | StepRun | false fails the run now, skipping remaining attempts (NonRetriableError). |
retryAfterMs | number (optional) | StepRun | Overrides the policy backoff for this retry (RetryAfterError). |
sleepMs | number | Sleep | Duration in ms. |
sleepUntilMs | number | SleepUntil | Absolute wake time (UTC epoch ms). |
eventName | string | WaitForEvent, Emit | Awaited / emitted event name. |
timeoutMs | number | WaitForEvent | Timeout in ms. |
childName | string | RunWorkflow | Child workflow to invoke. |
childData | JSON | RunWorkflow | Input passed to the child. |
webhookUrl | string | Webhook | Destination URL for a ctx.webhook.send; the engine enqueues a durable outbound delivery to it carrying data (a custom send has no endpoint secret, so it is delivered unsigned). |
StepError is { "message": string, "stack"?: string }.
Logs
A pass's response also carries the structured logs the handler emitted via ctx.log. The same array
shape rides every status (200, 206, and 500 - logs up to a throw still ship), so a log is never
lost to the path a pass took:
{ "level": "info", "message": "charging card", "fields": { "amount": 4200 }, "scope": "charge", "index": 0, "tsMs": 1718900000000 }| Field | Type | Description |
|---|---|---|
level | enum | debug | info | warn | error. |
message | string | The log message. |
fields | object (optional) | Structured fields. Sensitive keys are redacted engine-side before storage. |
scope | string | The enclosing step name, or @root for a handler-level log. |
index | number | A per-scope counter the engine uses (with scope + attempt) to give each line a replay-stable dedupe id. |
tsMs | number | Runner wall clock (advisory). |
Because the handler body re-runs on every pass, a top-level (@root) log re-emits each pass; the
engine deduplicates it by (runId, attempt, scope, index) so it persists once. An in-step log only
runs on the pass where its step executes, and its dedupe is keyed on the step's attempt, so a retried
step's logs stay distinct per attempt. Read them back via
GET /runs/{id}/logs.
Sleep.sleepMs is a duration, not an absolute time. The runner has no clock authority; it says
"sleep 10s" and the engine resolves the wake time when it persists the sleep, which keeps the
directive idempotent across passes. SleepUntil.sleepUntilMs is the absolute counterpart (UTC epoch
ms) for step.sleepUntil: a fixed wall-clock target the engine stores verbatim as the deadline and
parks against its own clock until it arrives. A target already in the past wakes on the next cycle.
A Webhook opcode (ctx.webhook.send) is a durable side effect, not a step result: the engine
enqueues a signed outbound delivery to webhookUrl carrying data, then records the step. Like Emit,
it is at-least-once on the wire but made exactly-once by the deterministic step id, so a replayed pass
never re-sends. See the webhooks guide.
A single pass can return several opcodes: a handler that runs steps with Promise.all discovers the
whole batch at once, and the 206 array carries them all (sequential steps are just the one-opcode
case). The engine persists every opcode, parks at the earliest deadline across the batch, and
re-invokes as each branch is ready; the handler proceeds once all are memoized. A terminal failure in
one branch fails the run and cancels its in-flight siblings. A step the engine has already started but
not finished comes back as "pending": true in the memo - the runner must not re-run or re-emit it,
so a parked branch's timer survives re-invokes its siblings trigger.
Step id hashing
hashedStepId = lowercase_hex( SHA-256( utf8(stepId) ) )The runner hashes the human-readable step id (e.g. "charge") to produce the steps map key and the
Opcode.id; the engine treats it as an opaque key. Implementations in different languages must
produce byte-identical output. A step id reused within one run is disambiguated runner-side before
hashing by a positional suffix ("x", then "x:1", "x:2", ...), so each occurrence gets a distinct
key; see the full spec for the exact scheme.
Routing ({app, runner?})
A run is owned by an app and executed by one of that app's registered runners (an app may have many).
The runner id is the routing handle:
- Anycast (no
runner): each invoke goes to any one registered runner of the app (random pick). Runners are stateless and the full step memo is resent every invoke, so different passes may safely hit different replicas. If none is registered yet, the run parks and retries (the event-before-register race self-heals). - Pinned (
runnerset on the event): routed only to that runner id. If it isn't registered, the run fails fast - a pin to a non-existent runner is a caller error in v0. Bounded waiting and offline drain for pinned runs are a future addition.
A runner that omits an id in /register is keyed by its URL. A child workflow inherits its parent's
pin only in the same app. ctx.runner carries the pin to your handler (empty for anycast).
Connect transport (WebSocket)
Two transports drive a runner, and the choice is invisible to your workflow code:
- Serve (default): you stand up an HTTP server and the engine POSTs your
/invokeURL. Simple, but the runner must be inbound-reachable. - Connect: your runner dials the engine over a WebSocket (
GET /connect) and receives invokes on that socket, so it needs no inbound address - the model for an agent on a node behind NAT. In the SDK this isconnect({ engineUrl, app, runner?, workflows })instead ofserve(...)+register(...).
The execution model is identical (same step-memoization replay); only the connection direction differs.
Connect uses durablex's own protocol (subprotocol durablex.connect.v0, a generic
{type, id, payload} envelope, message types hello / runner.register / invoke / invoke.result),
and the stable runner id is part of the handshake, so a connected runner is pinnable by
{app, runner} exactly like an HTTP one. The hello and runner.register frames also carry the wire
version (see Protocol version). A dropped socket is detected by heartbeat and the
runner is evicted from routing until it reconnects (the SDK reconnects automatically).
Status codes
| Code | Meaning |
|---|---|
200 | Handler returned. Body { data, logs } carries the final result. The run succeeds. |
206 | Handler emitted new opcodes and isn't finished. Body is { opcodes, logs }. |
4xx | Non-retriable error (bad request, unknown workflow). The engine fails the run. |
5xx | Retriable transport error. The engine retries the invoke with backoff on a fixed transient budget, then fails the run. |
A step failure is different from a transport error: the runner reports it as a 206 whose opcode
carries error, and the engine retries that step per the workflow's retry policy. Per-step retries
and the transient transport budget are counted separately.
The engine caps the invoke response at 1 MiB (the same wire-message limit the connect transport applies to a result frame), so a runaway runner can't exhaust engine memory with an unbounded result; a response over the cap fails the invoke. Large step state is offloaded by API, not carried inline.
When a run fails terminally and its workflow registered onFailure: true, the engine marks the run
failed and spawns a separate follow-on run invoked with ctx.onFailure + ctx.error to run the
onFailure handler. The failed run is retained: queryable via GET /runs?status=failed
and redrivable via POST /runs/{id}/replay. See Retries.
Protocol version
The engine and runner share a single integer wire version (currently 1; v1 wrapped the invoke
response in an object on every status so logs ride alongside the result - v0 returned a bare
{data} on 200 and a bare opcode array on 206). Each side advertises it and checks the peer's at
every boundary, so a breaking wire change fails loudly instead of misparsing:
| Boundary | Carried as |
|---|---|
| HTTP invoke (engine → runner) | the X-Durablex-Protocol header |
| HTTP register (runner → engine) | RegisterRequest.protocolVersion |
| Connect handshake | protocolVersion on the hello and runner.register frames |
The rule is lenient on absence, strict on a present mismatch: a peer that sends no version is
assumed compatible, so the field is additive and never breaks an older peer, but a version that is
present and differs is rejected (400 on the HTTP paths; the socket closes on Connect). The SDK sets
this for you - you only encounter it if a runner and engine are on incompatible releases.
Signing
Every request between engine and runner carries an integrity header. Today this is a pass-through stub behind a signing interface, so real HMAC-SHA256 can drop in later without changing call sites.