Monitoring
Zeridion Flare provides three monitoring pillars out of the box: OpenTelemetry tracing, a Metrics API for custom dashboards, and health endpoints for container orchestration.
OpenTelemetry tracing
The Flare API ships with OpenTelemetry pre-configured. Every request generates distributed traces with automatic instrumentation for:
- ASP.NET Core — HTTP request spans with route, status code, and duration
- HttpClient — outbound HTTP call spans
- Database client — durable-storage query spans with command text and duration
Tenant tagging
Authenticated /flare/v1/* request spans are enriched with tenant context:
| Tag | Value | Description |
|---|---|---|
tenant.id | Project ID | Identifies which project (customer) owns the request |
tenant.plan | Plan name | The project's pricing tier (free, starter, pro, business) |
This makes it straightforward to filter traces and metrics by tenant in your observability platform.
Trace export
Spans emitted by your application code are visible in your own observability stack via standard OpenTelemetry exporters. Flare propagates the W3C traceparent header end-to-end so that your client traces, the Flare API request, and your worker's job execution all roll up into a single distributed trace.
OpenTelemetry export configuration
Point your worker (and any Flare API instance you self-host) at an OTLP collector by setting the standard OTEL_EXPORTER_OTLP_* environment variables. The .NET host honours them out of the box:
# OTLP endpoint — most collectors expose /v1/traces on port 4317 (gRPC) or 4318 (HTTP)
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otel.example.com:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" # or "grpc"
export OTEL_EXPORTER_OTLP_HEADERS="api-key=$OTEL_API_KEY"
export OTEL_SERVICE_NAME="my-worker" # shows up as service.name on every span
Then register the exporter in your worker's Program.cs alongside the SDK:
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
builder.Services.AddZeridionFlare(o =>
{
o.ApiKey = builder.Configuration["FLARE_API_KEY"]!;
});
builder.Services
.AddOpenTelemetry()
.ConfigureResource(r => r.AddService(serviceName: "my-worker"))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddSource("Zeridion.Flare.Sdk") // worker poll/ack/heartbeat spans
.AddOtlpExporter()); // reads OTEL_EXPORTER_OTLP_* from env
Every /flare/v1/* request span and worker SDK span carries the tenant.id and tenant.plan attributes automatically — no manual enrichment needed. You can filter traces by tenant in your observability platform's UI, or add a tail-based sampler that keeps 100% of spans for a specific tenant.id when triaging a customer incident.
Metrics API
Three endpoints provide aggregate job metrics, all scoped to the authenticated project. Use them to build custom dashboards, feed alerting systems, or integrate with external monitoring tools.
Summary
GET /flare/v1/metrics/summary?period=24h
Returns state counts, success rate, and average duration for the specified period.
Query parameters:
| Parameter | Values | Default | Description |
|---|---|---|---|
period | 1h, 24h, 7d, 30d | 24h | Time window for aggregation |
Response:
{
"total": 1523,
"pending": 12,
"scheduled": 3,
"processing": 8,
"succeeded": 1450,
"failed": 5,
"cancelled": 20,
"dead_letter": 25,
"success_rate": 0.9797,
"avg_duration_ms": 342.5,
"period": "24h"
}
| Field | Description |
|---|---|
success_rate | succeeded / (succeeded + failed + dead_letter), rounded to 4 decimal places |
avg_duration_ms | Average execution time across jobs that reported duration_ms, or null if none |
Throughput
GET /flare/v1/metrics/throughput?period=7d&granularity=hour
Returns time-bucketed counts of enqueued, succeeded, and failed jobs.
Query parameters:
| Parameter | Values | Default | Description |
|---|---|---|---|
period | 1h, 24h, 7d, 30d | 24h | Time window |
granularity | minute, hour, day | Auto | Bucket size |
Auto-granularity (when granularity is omitted):
| Period | Default granularity |
|---|---|
1h | minute |
24h | hour |
7d | hour |
30d | day |
Response:
{
"period": "7d",
"granularity": "hour",
"data": [
{
"timestamp": "2026-03-20T00:00:00+00:00",
"enqueued": 45,
"succeeded": 42,
"failed": 1
},
{
"timestamp": "2026-03-20T01:00:00+00:00",
"enqueued": 38,
"succeeded": 37,
"failed": 0
}
]
}
Queue depth
GET /flare/v1/metrics/queues
Returns the current depth of each queue — how many jobs are pending, processing, or scheduled per queue.
Response:
{
"queues": [
{
"name": "default",
"pending": 15,
"processing": 5,
"scheduled": 2
},
{
"name": "email",
"pending": 3,
"processing": 1,
"scheduled": 0
}
]
}
Use queue depth for autoscaling decisions: when pending grows faster than processing can drain it, you need more workers on that queue.
Health endpoints
Two unauthenticated endpoints for container orchestration probes. Understand the semantic difference before wiring them up:
- Liveness failure → pod restart. A failing liveness probe tells the orchestrator the process is wedged and must be killed. Use this for catastrophic in-process state (deadlock, OOM-induced unresponsiveness) where restarting fixes things.
- Readiness failure → traffic removal, cluster-level incident. A failing readiness probe tells the orchestrator to stop sending traffic but leave the pod alive. When every replica goes unready simultaneously (e.g. the database is down), that's a cluster-wide outage — restarting won't help, and an aggressive liveness probe would just crash-loop every pod into the same downstream failure.
Liveness
GET /health/live
Returns 200 Healthy if the process is running and can accept HTTP requests. No external dependencies are checked — this is purely a process liveness signal.
Use as: Kubernetes liveness probe, Azure Container Apps liveness probe.
Readiness
GET /health/ready
Returns 200 Healthy if the process can serve traffic, including verifying that durable storage is reachable. Returns 503 Unhealthy if the storage connection fails.
Use as: Kubernetes readiness probe, Azure Container Apps readiness probe, load balancer health check.
Recommended probe configuration
These values balance fast-failure detection against transient flakiness. Tune for your environment, but keep readiness more forgiving than liveness so a 30-second DB blip doesn't crash-loop the pod fleet.
| Probe | initialDelaySeconds | periodSeconds | timeoutSeconds | failureThreshold | Effective fail-over time |
|---|---|---|---|---|---|
Liveness (/health/live) | 30 | 10 | 3 | 3 | ~60s before the pod is killed (delay + 3× period) |
Readiness (/health/ready) | 15 | 5 | 3 | 3 | ~30s before the pod is taken out of rotation |
# Kubernetes deployment snippet
livenessProbe:
httpGet:
path: /health/live
port: 5100
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 5100
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Both health endpoints are unauthenticated — no API key required. They are excluded from the authentication middleware pipeline.
Rate limit monitoring
Every /flare/v1/* response includes rate limit headers:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests per hour for this project |
X-RateLimit-Remaining | Requests remaining in the current window |
X-RateLimit-Reset | Unix timestamp (seconds) when the window resets |
Monitor X-RateLimit-Remaining proactively. When it drops below 10% of X-RateLimit-Limit, throttle your request rate to avoid hitting 429.
See the Rate Limits reference for tier details and backoff strategies.
Dashboard
The built-in Zeridion dashboard visualizes all metrics in real time:
- Overview page — summary cards (total, succeeded, failed, dead letter), throughput chart, state distribution pie chart, queue depth bar chart
- Jobs list — filterable by state, searchable, with cursor-based pagination
- Job detail — payload, error details, progress bar, metadata, state badge
The dashboard polls the metrics API automatically using TanStack React Query, so data stays fresh without manual refresh.
Alerting patterns
While built-in alerting (email, Slack) is planned for a future release, you can build custom alerting today by polling the metrics API.
The Flare API returns snake_case JSON (e.g., dead_letter, success_rate). When using GetFromJsonAsync with custom POCOs, pass JsonSerializerOptions with JsonNamingPolicy.SnakeCaseLower or use [JsonPropertyName] attributes to match the API field names.
Building a runbook
A useful alert tells the on-call engineer what to do, not just that something is wrong. For each threshold below, the suggested runbook shape is:
| Alert category | Likely root causes | Triage playbook |
|---|---|---|
| Dead-letter spike | New deploy broke a job handler; downstream service (DB, third-party API) is failing for a specific job type | Filter GET /flare/v1/jobs?state=dead_letter for the affected job_type and inspect error_message / error_stack_trace on the most recent rows. If a single error type dominates, roll back or hot-fix; if errors are diverse, suspect a shared downstream. |
| Queue backlog growth | Worker fleet under-scaled for current load; a slow job_type is hogging worker slots; workers are crash-looping | Check worker pod count vs. pending count, look at processing count (low = workers idle = worker-side problem, high = workers busy = capacity problem). Scale the worker deployment or split slow jobs into their own queue. |
| Success-rate drop | Downstream outage (third-party API down); recent deploy regression; idempotency conflict storm from a buggy client | Cross-reference the drop time with recent deploys and external status pages. Inspect error_message distribution to fingerprint the failure. |
| Probe failure (cluster-wide) | DB outage; cluster-level DNS or networking issue; bad config rollout to all pods | This is a readiness incident, not a per-pod restart. Check the DB first, then rollback the most recent config push. |
| Probe failure (single pod) | OOM, deadlock, or runaway thread in one replica | This is a liveness event — the orchestrator will restart the pod automatically. Investigate post-mortem if it recurs. |
When you write a runbook entry, link the alert directly to its Triage playbook row in your team wiki so the paged engineer doesn't have to memorise it.
Dead letter alert
Poll the summary endpoint and alert when dead letter count exceeds a threshold. The threshold should be calibrated to your plan tier and steady-state volume — "10 dead letters per hour" is meaningful for a starter-plan project processing ~1000 jobs/hour (1% dead-letter rate), but trivially common noise for a business-plan project processing millions/hour:
| Plan tier | Suggested starting threshold (dead letters / hour) | Notes |
|---|---|---|
| Free | > 3 | Low traffic — any spike is unusual |
| Starter | > 10 | Equivalent to ~1% of a 1k/hr workload |
| Pro | > 50 | Tolerate routine noise; alert on burst |
| Business | > 500 (or dead_letter_rate > 1%) | Use a rate-based threshold, not an absolute count |
private static readonly JsonSerializerOptions JsonOptions = new()
{
PropertyNamingPolicy = JsonNamingPolicy.SnakeCaseLower,
PropertyNameCaseInsensitive = true
};
public class DeadLetterMonitor(HttpClient http, IAlertService alerts) : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
var summary = await http.GetFromJsonAsync<MetricsSummary>(
"/flare/v1/metrics/summary?period=1h", JsonOptions, ct);
// TODO: Adjust threshold based on your plan tier and steady-state job volume.
// Starting point: 10 for starter, 50 for pro, rate-based (>1%) for business.
if (summary?.DeadLetter > 10)
{
await alerts.SendAsync(
$"Dead letter alert: {summary.DeadLetter} dead-lettered jobs in the last hour");
}
await Task.Delay(TimeSpan.FromMinutes(5), ct);
}
}
}
Queue backlog alert
Monitor queue depth and alert when pending jobs accumulate beyond capacity. Pending-job thresholds scale with your worker concurrency and acceptable end-to-end latency SLO:
| Plan tier (or worker fleet) | Suggested starting threshold (pending / queue) | Notes |
|---|---|---|
| Free / single worker | > 100 | Backlog above ~10× concurrency means you're losing ground |
| Starter / 1–2 workers | > 250 | |
| Pro / 5–10 workers | > 1000 | |
| Business / 10+ workers | > 5000 (or pending / claimed_per_minute > SLO_minutes) | Express as time-to-drain, not absolute count |
var queues = await http.GetFromJsonAsync<QueueDepthResponse>(
"/flare/v1/metrics/queues", JsonOptions, ct);
foreach (var queue in queues?.Queues ?? [])
{
// TODO: Adjust based on your worker fleet size and end-to-end latency SLO.
// Express as "time to drain" (pending / claim-rate) when you have stable throughput data.
if (queue.Pending > 1000)
{
await alerts.SendAsync(
$"Queue backlog: {queue.Name} has {queue.Pending} pending jobs");
}
}
Success rate drop
Alert when the success rate falls below an acceptable threshold. Success-rate thresholds are SLO-driven, not plan-tier-driven — but the floor you can practically defend varies with traffic volume (tiny denominators produce noisy rates):
| Steady-state job volume / hour | Suggested success-rate floor | Notes |
|---|---|---|
< 100 | < 0.80 | Tiny denominator — a single failure swings the rate by 1pp |
100 – 1k | < 0.90 | |
1k – 100k | < 0.95 | Common SLO target |
> 100k | < 0.99 (or per-job_type rate) | Aggregate hides individual-handler regressions; alert per-type |
var summary = await http.GetFromJsonAsync<MetricsSummary>(
"/flare/v1/metrics/summary?period=1h", JsonOptions, ct);
// TODO: Adjust based on your SLO. Floor at 0.80 only makes sense for tiny denominators —
// most production workloads should target 0.95+ aggregate or per-job-type.
if (summary is not null && summary.SuccessRate < 0.95)
{
await alerts.SendAsync(
$"Success rate dropped to {summary.SuccessRate:P1} in the last hour");
}
Quota monitoring
Every POST /flare/v1/jobs response includes X-Quota-* headers that tell you how much of your daily job creation quota has been used:
| Header | Description |
|---|---|
X-Quota-Limit | Maximum jobs per day for this project |
X-Quota-Used | Jobs created today (UTC) |
X-Quota-Reset | Unix timestamp of next midnight UTC |
Proactive quota monitoring
Monitor the ratio of X-Quota-Used to X-Quota-Limit on job creation responses. When usage exceeds 80% of the daily limit, consider upgrading your plan or deferring non-critical jobs.
var response = await httpClient.PostAsync("/flare/v1/jobs", content);
if (response.Headers.TryGetValues("X-Quota-Used", out var usedValues)
&& response.Headers.TryGetValues("X-Quota-Limit", out var limitValues)
&& int.TryParse(usedValues.First(), out var used)
&& int.TryParse(limitValues.First(), out var limit))
{
var usagePercent = (double)used / limit;
if (usagePercent > 0.8)
{
logger.LogWarning("Daily quota at {Percent:P0} ({Used}/{Limit})",
usagePercent, used, limit);
}
}
If you receive a 429 with error code monthly_allowance_exceeded, the allowance resets at the time indicated by X-Quota-Reset (the billing-period end). See Rate Limits for full details.
See also
- Metrics API — full endpoint documentation
- Rate Limits — tier limits, monthly allowances, and 429 handling
- Progress Reporting — per-job progress tracking
- Queues and Concurrency — queue depth and scaling