Retries & failure handling
How a failing step retries, and what happens when a run exhausts them - onFailure handlers and replaying failed runs.
When a step throws, Durablex retries that step - not the whole workflow. Steps that already succeeded keep their saved results and are never re-run.
Opting in
By default a workflow does not retry: a step runs once, and if it throws, the run fails. Opt in
per workflow with retry:
const orderCreated = defineWorkflow<OrderData>({
name: "order.created",
retry: { maxAttempts: 3 },
handler: async (ctx) => {
await ctx.step.run("charge", () => chargeCard(ctx.event.data));
},
});| Property | Type | Default | Description |
|---|---|---|---|
maxAttempts | number | 1 | Total times a step may run, including the first. 1 means no retry. |
maxAttempts is the total number of times a step may run, including the first. With maxAttempts: 3
a step runs once and retries up to twice more.
What happens on failure
- A step throws.
- Durablex waits a backoff delay, then runs that step again.
- This repeats until the step succeeds or reaches
maxAttempts. - If the step still fails on its last attempt, the run fails.
Each retry is a fresh attempt at that one step. Earlier steps are not repeated, so a workflow that charged a card and then failed to ship won't charge again while retrying the shipment.
const charge = await ctx.step.run("charge", () => chargeCard(order));
const ship = await ctx.step.run("ship", () => createShipment(order));Backoff
Retries wait between attempts rather than hammering immediately. Today the delay is a fixed ~1s
between attempts (the engine supports linear and exponential curves internally, but registration
sends only maxAttempts, so every registered workflow uses fixed backoff). A struggling dependency
gets room to recover either way.
Controlling retries from a step
Two error types let a step override the default retry behavior. Import them from @durablex/sdk.
NonRetriableError
Throw NonRetriableError to fail the run immediately, skipping any remaining attempts. Use it for
failures retrying cannot fix, such as a validation error or missing configuration.
import { NonRetriableError } from "@durablex/sdk";
await ctx.step.run("validate", () => {
if (!apiKey) throw new NonRetriableError("missing API key");
});RetryAfterError
Throw RetryAfterError to retry after a delay you choose instead of the policy's backoff, for
example honoring an upstream 429's Retry-After. The delay is a duration string ("30s"), a
millisecond number, or an absolute Date.
import { RetryAfterError } from "@durablex/sdk";
await ctx.step.run("call-upstream", async () => {
const res = await fetch(url);
if (res.status === 429) throw new RetryAfterError("rate limited", "30s");
return res.json();
});RetryAfterError does not grant extra attempts - it only changes when the next attempt runs.
Once the step reaches maxAttempts the run fails as usual.
Transport errors
If the engine can't reach your runner at all (the process is down or returns a server error), it
retries the request on its own short schedule, separate from your maxAttempts. A brief network blip
won't burn a step's retry budget or fail the run.
onFailure
Declare an onFailure handler to run compensation or notification logic when a run fails. It fires only
once the run has exhausted its retries (or hit a non-retriable error) and been marked failed, and it
receives the original event plus ctx.error.
const capturePayment = defineWorkflow<{ paymentId: string }>({
name: "payment.capture",
retry: { maxAttempts: 3 },
handler: async (ctx) => {
await ctx.step.run("capture", () => capture(ctx.event.data.paymentId));
},
onFailure: async (ctx) => {
await ctx.step.run("void-hold", () => voidHold(ctx.event.data.paymentId));
},
});| Property | Type | Description |
|---|---|---|
ctx.error | { message, stack? } | The terminal error that failed the run. Set only inside onFailure. |
onFailure runs as its own durable execution: its steps are memoized and retried like any handler. It
cannot un-fail the run - the failed run stays failed. Use it to compensate (release a hold, reverse
a write) or notify, not to retry the work.
Failed runs
A run that exhausts its retries is retained as a failed run carrying its error. Because a run is
marked failed only once it is terminal and out of retries (runs still retrying stay non-terminal),
the set of failed runs is every run that permanently failed. List them with
GET /runs?status=failed, inspect one with GET /runs/{id}, and redrive it with
POST /runs/{id}/replay (a fresh run from the same trigger). To redrive without re-running the work
that already succeeded, use retry-from-step, which forks from
a chosen step and carries the completed steps before it. Nothing is silently dropped; failed work waits
to be inspected or replayed.
In the console, failed runs live in the normal Runs view, surfaced for triage: each failed row
shows its failure reason inline, and the Failed stat tile is a one-click filter (click it to focus
on failed runs, click again to clear). You can also filter by the Failed status from the dropdown,
replay one from its inspector, or bulk replay everything matching the
current filter.