Retries & failure handling

How a failing step retries, and what happens when a run exhausts them - onFailure handlers and replaying failed runs.

When a step throws, Durablex retries that step - not the whole workflow. Steps that already succeeded keep their saved results and are never re-run.

Opting in

By default a workflow does not retry: a step runs once, and if it throws, the run fails. Opt in per workflow with retry:

const orderCreated = defineWorkflow<OrderData>({
  name: "order.created",
  retry: { maxAttempts: 3 },
  handler: async (ctx) => {
    await ctx.step.run("charge", () => chargeCard(ctx.event.data));
  },
});

Property	Type	Default	Description
`maxAttempts`	`number`	`1`	Total times a step may run, including the first. `1` means no retry.

maxAttempts is the total number of times a step may run, including the first. With maxAttempts: 3 a step runs once and retries up to twice more.

What happens on failure

A step throws.
Durablex waits a backoff delay, then runs that step again.
This repeats until the step succeeds or reaches maxAttempts.
If the step still fails on its last attempt, the run fails.

Each retry is a fresh attempt at that one step. Earlier steps are not repeated, so a workflow that charged a card and then failed to ship won't charge again while retrying the shipment.

const charge = await ctx.step.run("charge", () => chargeCard(order));
const ship = await ctx.step.run("ship", () => createShipment(order));

Retries wait between attempts rather than hammering immediately. Today the delay is a fixed ~1s between attempts (the engine supports linear and exponential curves internally, but registration sends only maxAttempts, so every registered workflow uses fixed backoff). A struggling dependency gets room to recover either way.

Controlling retries from a step

Two error types let a step override the default retry behavior. Import them from @durablex/sdk.

NonRetriableError

Throw NonRetriableError to fail the run immediately, skipping any remaining attempts. Use it for failures retrying cannot fix, such as a validation error or missing configuration.

import { NonRetriableError } from "@durablex/sdk";

await ctx.step.run("validate", () => {
  if (!apiKey) throw new NonRetriableError("missing API key");
});

RetryAfterError

Throw RetryAfterError to retry after a delay you choose instead of the policy's backoff, for example honoring an upstream 429's Retry-After. The delay is a duration string ("30s"), a millisecond number, or an absolute Date.

import { RetryAfterError } from "@durablex/sdk";

await ctx.step.run("call-upstream", async () => {
  const res = await fetch(url);
  if (res.status === 429) throw new RetryAfterError("rate limited", "30s");
  return res.json();
});

RetryAfterError does not grant extra attempts - it only changes when the next attempt runs. Once the step reaches maxAttempts the run fails as usual.

Transport errors

If the engine can't reach your runner at all (the process is down or returns a server error), it retries the request on its own short schedule, separate from your maxAttempts. A brief network blip won't burn a step's retry budget or fail the run.

onFailure

Declare an onFailure handler to run compensation or notification logic when a run fails. It fires only once the run has exhausted its retries (or hit a non-retriable error) and been marked failed, and it receives the original event plus ctx.error.

const capturePayment = defineWorkflow<{ paymentId: string }>({
  name: "payment.capture",
  retry: { maxAttempts: 3 },
  handler: async (ctx) => {
    await ctx.step.run("capture", () => capture(ctx.event.data.paymentId));
  },
  onFailure: async (ctx) => {
    await ctx.step.run("void-hold", () => voidHold(ctx.event.data.paymentId));
  },
});

Property	Type	Description
`ctx.error`	`{ message, stack? }`	The terminal error that failed the run. Set only inside `onFailure`.

onFailure runs as its own durable execution: its steps are memoized and retried like any handler. It cannot un-fail the run - the failed run stays failed. Use it to compensate (release a hold, reverse a write) or notify, not to retry the work.

Failed runs

A run that exhausts its retries is retained as a failed run carrying its error. Because a run is marked failed only once it is terminal and out of retries (runs still retrying stay non-terminal), the set of failed runs is every run that permanently failed. List them with GET /runs?status=failed, inspect one with GET /runs/{id}, and redrive it with POST /runs/{id}/replay (a fresh run from the same trigger). To redrive without re-running the work that already succeeded, use retry-from-step, which forks from a chosen step and carries the completed steps before it. Nothing is silently dropped; failed work waits to be inspected or replayed.

In the console, failed runs live in the normal Runs view, surfaced for triage: each failed row shows its failure reason inline, and the Failed stat tile is a one-click filter (click it to focus on failed runs, click again to clear). You can also filter by the Failed status from the dropdown, replay one from its inspector, or bulk replay everything matching the current filter.