Skip to main content

Retry Strategies

When a job fails, Zeridion Flare automatically retries it with exponential backoff and jitter. You control how many times a job is retried and what happens when all attempts are exhausted.

How retries work

  1. A worker picks up a job and calls your ExecuteAsync method
  2. If ExecuteAsync throws an unhandled exception, the worker reports the failure back to Flare
  3. The server checks whether AttemptNumber < MaxAttempts
  4. If retries remain: the job returns to Pending with a RunAt delay (exponential backoff + jitter)
  5. If retries are exhausted: the job moves to DeadLetter

Exponential backoff with jitter

The retry delay doubles with each attempt, starting at 60 seconds. A uniform random jitter in the range 0–3000 ms (0–3 seconds, millisecond resolution) is added to prevent thundering herd when many jobs fail simultaneously. Both the exponent and the resulting delay are server-clamped so a retry can never schedule into the past — the maximum effective delay is 6 hours regardless of attempt number.

Formula: delay = 60s × 2^(attempt - 1) + random_uniform(0–3000 ms)

AttemptBase delayActual range
160s (1 min)60–63s
2120s (2 min)120–123s
3240s (4 min)240–243s
4480s (8 min)480–483s
5960s (16 min)960–963s
61920s (32 min)1920–1923s
73840s (64 min)3840–3843s

With the default MaxAttempts = 3, a job gets three tries spanning roughly 6 minutes of total backoff (60s + 120s ≈ 3 minutes between attempts 1→3, plus the original execution time) before dead-lettering.

Configuring MaxAttempts

You can set the maximum retry count at three levels. More specific settings override less specific ones.

Per-class default

Apply [JobConfig] to set a default for all enqueues of this job type:

[JobConfig(MaxAttempts = 5)]
public class SendWelcomeEmail : IJob<NewUserPayload>
{
public async Task ExecuteAsync(NewUserPayload payload, JobContext ctx)
{
// Up to 5 attempts before dead letter
}
}

Per-call override

Pass JobOptions when enqueuing to override the class default for a specific enqueue:

await jobs.EnqueueAsync<SendWelcomeEmail>(payload, new JobOptions
{
MaxAttempts = 10
});

Precedence

LevelHow to setDefault
Per-callnew JobOptions { MaxAttempts = N }
Per-class[JobConfig(MaxAttempts = N)]3
Server-side clamp1–100

Resolution order: JobOptions (per-call) > [JobConfig] (per-class) > 3 (hardcoded default).

The server clamps the final value to the range 1–100. Values outside this range are reset to 3.

AttemptNumber tracking

AttemptNumber starts at 0 on the job entity and is incremented to 1 when a worker first claims the job. Inside your ExecuteAsync, ctx.AttemptNumber gives the current attempt number.

Use it for conditional logic:

public async Task ExecuteAsync(PaymentPayload payload, JobContext ctx)
{
if (ctx.AttemptNumber == ctx.MaxAttempts)
{
ctx.Logger.LogWarning("Final attempt for job {JobId}, alerting ops", ctx.JobId);
await _alertService.NotifyAsync($"Job {ctx.JobId} on final attempt");
}

await ProcessPayment(payload, ctx.CancellationToken);
}

Dead letter

When AttemptNumber >= MaxAttempts after a failure, the job moves to DeadLetter:

  • State is set to DeadLetter
  • CompletedAt is set to the current time
  • Error details (ErrorType, ErrorMessage, ErrorStackTrace) are preserved from the last failure
  • Any child continuation jobs in Scheduled state are cancelled

Dead-lettered jobs remain in the database for inspection. They are not deleted or cleaned up automatically.

Querying dead letter jobs

GET /flare/v1/jobs?state=dead_letter&limit=50

Manual retry from dead letter

You can requeue a dead-lettered job via the API, SDK, or dashboard:

API

POST /flare/v1/jobs/{id}/retry

This resets the job to Pending, clears error/worker/timing fields, and bumps MaxAttempts if the current AttemptNumber has already reached it.

SDK

var retried = await jobs.RetryAsync(jobId);
// returns true if requeued, false if job is not in a retryable state
tip

RetryAsync returns false (instead of throwing) when the job is in a state that cannot be retried (e.g., Processing or Succeeded). No try/catch needed.

Dashboard

Click the Retry button on the job detail page to requeue a dead-lettered job with one click.

HTTP client retries (SDK to API)

The job-level retries described above are separate from the SDK's HTTP transport retries. The SDK registers its HTTP client with AddStandardResilienceHandler() from Microsoft.Extensions.Http.Resilience, which provides:

  • Retry — automatic retry with exponential backoff for transient HTTP failures (5xx, timeouts)
  • Circuit breaker — stops sending requests when the API is consistently failing
  • Timeout — per-request and total timeout enforcement

These transport-level retries protect against network blips and temporary API outages. They happen transparently before your code sees the response.

Best practices

  1. Keep jobs idempotent — since jobs may execute more than once, design ExecuteAsync so that re-running with the same payload produces the same result. Use database upserts, check-before-write, or idempotency keys on downstream calls.

  2. Use ctx.AttemptNumber for logging — always include the attempt number in your log messages so you can trace the retry history:

    ctx.Logger.LogInformation(
    "Attempt {Attempt}/{Max} for job {JobId}",
    ctx.AttemptNumber, ctx.MaxAttempts, ctx.JobId);
  3. Set reasonable timeouts — jobs without timeouts can run indefinitely and block the worker. Use [JobConfig(TimeoutSeconds = 300)] to cap execution time. The worker reports progress periodically; if it stops reporting, Flare reclaims the job for retry.

  4. Don't catch and swallow all exceptions — let unexpected exceptions bubble up so the retry engine can do its job. Only catch exceptions when you need to prevent retries (e.g., invalid input data that will never succeed).

  5. Monitor dead letter counts — use GET /flare/v1/metrics/summary to track dead letter accumulation. A rising dead letter count signals a systemic issue.

See also