Retry Strategies
When a job fails, Zeridion Flare automatically retries it with exponential backoff and jitter. You control how many times a job is retried and what happens when all attempts are exhausted.
How retries work
- A worker picks up a job and calls your
ExecuteAsyncmethod - If
ExecuteAsyncthrows an unhandled exception, the worker reports the failure back to Flare - The server checks whether
AttemptNumber < MaxAttempts - If retries remain: the job returns to
Pendingwith aRunAtdelay (exponential backoff + jitter) - If retries are exhausted: the job moves to
DeadLetter
Exponential backoff with jitter
The retry delay doubles with each attempt, starting at 60 seconds. A uniform random jitter in the range 0–3000 ms (0–3 seconds, millisecond resolution) is added to prevent thundering herd when many jobs fail simultaneously. Both the exponent and the resulting delay are server-clamped so a retry can never schedule into the past — the maximum effective delay is 6 hours regardless of attempt number.
Formula: delay = 60s × 2^(attempt - 1) + random_uniform(0–3000 ms)
| Attempt | Base delay | Actual range |
|---|---|---|
| 1 | 60s (1 min) | 60–63s |
| 2 | 120s (2 min) | 120–123s |
| 3 | 240s (4 min) | 240–243s |
| 4 | 480s (8 min) | 480–483s |
| 5 | 960s (16 min) | 960–963s |
| 6 | 1920s (32 min) | 1920–1923s |
| 7 | 3840s (64 min) | 3840–3843s |
With the default MaxAttempts = 3, a job gets three tries spanning roughly 6 minutes of total backoff (60s + 120s ≈ 3 minutes between attempts 1→3, plus the original execution time) before dead-lettering.
Configuring MaxAttempts
You can set the maximum retry count at three levels. More specific settings override less specific ones.
Per-class default
Apply [JobConfig] to set a default for all enqueues of this job type:
[JobConfig(MaxAttempts = 5)]
public class SendWelcomeEmail : IJob<NewUserPayload>
{
public async Task ExecuteAsync(NewUserPayload payload, JobContext ctx)
{
// Up to 5 attempts before dead letter
}
}
Per-call override
Pass JobOptions when enqueuing to override the class default for a specific enqueue:
await jobs.EnqueueAsync<SendWelcomeEmail>(payload, new JobOptions
{
MaxAttempts = 10
});
Precedence
| Level | How to set | Default |
|---|---|---|
| Per-call | new JobOptions { MaxAttempts = N } | — |
| Per-class | [JobConfig(MaxAttempts = N)] | 3 |
| Server-side clamp | — | 1–100 |
Resolution order: JobOptions (per-call) > [JobConfig] (per-class) > 3 (hardcoded default).
The server clamps the final value to the range 1–100. Values outside this range are reset to 3.
AttemptNumber tracking
AttemptNumber starts at 0 on the job entity and is incremented to 1 when a worker first claims the job. Inside your ExecuteAsync, ctx.AttemptNumber gives the current attempt number.
Use it for conditional logic:
public async Task ExecuteAsync(PaymentPayload payload, JobContext ctx)
{
if (ctx.AttemptNumber == ctx.MaxAttempts)
{
ctx.Logger.LogWarning("Final attempt for job {JobId}, alerting ops", ctx.JobId);
await _alertService.NotifyAsync($"Job {ctx.JobId} on final attempt");
}
await ProcessPayment(payload, ctx.CancellationToken);
}
Dead letter
When AttemptNumber >= MaxAttempts after a failure, the job moves to DeadLetter:
Stateis set toDeadLetterCompletedAtis set to the current time- Error details (
ErrorType,ErrorMessage,ErrorStackTrace) are preserved from the last failure - Any child continuation jobs in
Scheduledstate are cancelled
Dead-lettered jobs remain in the database for inspection. They are not deleted or cleaned up automatically.
Querying dead letter jobs
GET /flare/v1/jobs?state=dead_letter&limit=50
Manual retry from dead letter
You can requeue a dead-lettered job via the API, SDK, or dashboard:
API
POST /flare/v1/jobs/{id}/retry
This resets the job to Pending, clears error/worker/timing fields, and bumps MaxAttempts if the current AttemptNumber has already reached it.
SDK
var retried = await jobs.RetryAsync(jobId);
// returns true if requeued, false if job is not in a retryable state
RetryAsync returns false (instead of throwing) when the job is in a state that cannot be retried (e.g., Processing or Succeeded). No try/catch needed.
Dashboard
Click the Retry button on the job detail page to requeue a dead-lettered job with one click.
HTTP client retries (SDK to API)
The job-level retries described above are separate from the SDK's HTTP transport retries. The SDK registers its HTTP client with AddStandardResilienceHandler() from Microsoft.Extensions.Http.Resilience, which provides:
- Retry — automatic retry with exponential backoff for transient HTTP failures (5xx, timeouts)
- Circuit breaker — stops sending requests when the API is consistently failing
- Timeout — per-request and total timeout enforcement
These transport-level retries protect against network blips and temporary API outages. They happen transparently before your code sees the response.
Best practices
-
Keep jobs idempotent — since jobs may execute more than once, design
ExecuteAsyncso that re-running with the same payload produces the same result. Use database upserts, check-before-write, or idempotency keys on downstream calls. -
Use
ctx.AttemptNumberfor logging — always include the attempt number in your log messages so you can trace the retry history:ctx.Logger.LogInformation("Attempt {Attempt}/{Max} for job {JobId}",ctx.AttemptNumber, ctx.MaxAttempts, ctx.JobId); -
Set reasonable timeouts — jobs without timeouts can run indefinitely and block the worker. Use
[JobConfig(TimeoutSeconds = 300)]to cap execution time. The worker reports progress periodically; if it stops reporting, Flare reclaims the job for retry. -
Don't catch and swallow all exceptions — let unexpected exceptions bubble up so the retry engine can do its job. Only catch exceptions when you need to prevent retries (e.g., invalid input data that will never succeed).
-
Monitor dead letter counts — use
GET /flare/v1/metrics/summaryto track dead letter accumulation. A rising dead letter count signals a systemic issue.
See also
- Error Handling — exception types and catch patterns
- Idempotency — preventing duplicate work across retries
- JobConfigAttribute — class-level MaxAttempts and TimeoutSeconds
- JobOptions — per-call MaxAttempts override