Skip to main content

Monitoring

Zeridion Flare provides three monitoring pillars out of the box: OpenTelemetry tracing, a Metrics API for custom dashboards, and health endpoints for container orchestration.

OpenTelemetry tracing

The Flare API ships with OpenTelemetry pre-configured. Every request generates distributed traces with automatic instrumentation for:

  • ASP.NET Core — HTTP request spans with route, status code, and duration
  • HttpClient — outbound HTTP call spans
  • Database client — durable-storage query spans with command text and duration

Tenant tagging

Authenticated /flare/v1/* request spans are enriched with tenant context:

TagValueDescription
tenant.idProject IDIdentifies which project (customer) owns the request
tenant.planPlan nameThe project's pricing tier (free, starter, pro, business)

This makes it straightforward to filter traces and metrics by tenant in your observability platform.

Trace export

Spans emitted by your application code are visible in your own observability stack via standard OpenTelemetry exporters. Flare propagates the W3C traceparent header end-to-end so that your client traces, the Flare API request, and your worker's job execution all roll up into a single distributed trace.

OpenTelemetry export configuration

Point your worker (and any Flare API instance you self-host) at an OTLP collector by setting the standard OTEL_EXPORTER_OTLP_* environment variables. The .NET host honours them out of the box:

# OTLP endpoint — most collectors expose /v1/traces on port 4317 (gRPC) or 4318 (HTTP)
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otel.example.com:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" # or "grpc"
export OTEL_EXPORTER_OTLP_HEADERS="api-key=$OTEL_API_KEY"
export OTEL_SERVICE_NAME="my-worker" # shows up as service.name on every span

Then register the exporter in your worker's Program.cs alongside the SDK:

using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

builder.Services.AddZeridionFlare(o =>
{
o.ApiKey = builder.Configuration["FLARE_API_KEY"]!;
});

builder.Services
.AddOpenTelemetry()
.ConfigureResource(r => r.AddService(serviceName: "my-worker"))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddSource("Zeridion.Flare.Sdk") // worker poll/ack/heartbeat spans
.AddOtlpExporter()); // reads OTEL_EXPORTER_OTLP_* from env

Every /flare/v1/* request span and worker SDK span carries the tenant.id and tenant.plan attributes automatically — no manual enrichment needed. You can filter traces by tenant in your observability platform's UI, or add a tail-based sampler that keeps 100% of spans for a specific tenant.id when triaging a customer incident.

Metrics API

Three endpoints provide aggregate job metrics, all scoped to the authenticated project. Use them to build custom dashboards, feed alerting systems, or integrate with external monitoring tools.

Summary

GET /flare/v1/metrics/summary?period=24h

Returns state counts, success rate, and average duration for the specified period.

Query parameters:

ParameterValuesDefaultDescription
period1h, 24h, 7d, 30d24hTime window for aggregation

Response:

{
"total": 1523,
"pending": 12,
"scheduled": 3,
"processing": 8,
"succeeded": 1450,
"failed": 5,
"cancelled": 20,
"dead_letter": 25,
"success_rate": 0.9797,
"avg_duration_ms": 342.5,
"period": "24h"
}
FieldDescription
success_ratesucceeded / (succeeded + failed + dead_letter), rounded to 4 decimal places
avg_duration_msAverage execution time across jobs that reported duration_ms, or null if none

Throughput

GET /flare/v1/metrics/throughput?period=7d&granularity=hour

Returns time-bucketed counts of enqueued, succeeded, and failed jobs.

Query parameters:

ParameterValuesDefaultDescription
period1h, 24h, 7d, 30d24hTime window
granularityminute, hour, dayAutoBucket size

Auto-granularity (when granularity is omitted):

PeriodDefault granularity
1hminute
24hhour
7dhour
30dday

Response:

{
"period": "7d",
"granularity": "hour",
"data": [
{
"timestamp": "2026-03-20T00:00:00+00:00",
"enqueued": 45,
"succeeded": 42,
"failed": 1
},
{
"timestamp": "2026-03-20T01:00:00+00:00",
"enqueued": 38,
"succeeded": 37,
"failed": 0
}
]
}

Queue depth

GET /flare/v1/metrics/queues

Returns the current depth of each queue — how many jobs are pending, processing, or scheduled per queue.

Response:

{
"queues": [
{
"name": "default",
"pending": 15,
"processing": 5,
"scheduled": 2
},
{
"name": "email",
"pending": 3,
"processing": 1,
"scheduled": 0
}
]
}

Use queue depth for autoscaling decisions: when pending grows faster than processing can drain it, you need more workers on that queue.

Health endpoints

Two unauthenticated endpoints for container orchestration probes. Understand the semantic difference before wiring them up:

  • Liveness failure → pod restart. A failing liveness probe tells the orchestrator the process is wedged and must be killed. Use this for catastrophic in-process state (deadlock, OOM-induced unresponsiveness) where restarting fixes things.
  • Readiness failure → traffic removal, cluster-level incident. A failing readiness probe tells the orchestrator to stop sending traffic but leave the pod alive. When every replica goes unready simultaneously (e.g. the database is down), that's a cluster-wide outage — restarting won't help, and an aggressive liveness probe would just crash-loop every pod into the same downstream failure.

Liveness

GET /health/live

Returns 200 Healthy if the process is running and can accept HTTP requests. No external dependencies are checked — this is purely a process liveness signal.

Use as: Kubernetes liveness probe, Azure Container Apps liveness probe.

Readiness

GET /health/ready

Returns 200 Healthy if the process can serve traffic, including verifying that durable storage is reachable. Returns 503 Unhealthy if the storage connection fails.

Use as: Kubernetes readiness probe, Azure Container Apps readiness probe, load balancer health check.

These values balance fast-failure detection against transient flakiness. Tune for your environment, but keep readiness more forgiving than liveness so a 30-second DB blip doesn't crash-loop the pod fleet.

ProbeinitialDelaySecondsperiodSecondstimeoutSecondsfailureThresholdEffective fail-over time
Liveness (/health/live)301033~60s before the pod is killed (delay + 3× period)
Readiness (/health/ready)15533~30s before the pod is taken out of rotation
# Kubernetes deployment snippet
livenessProbe:
httpGet:
path: /health/live
port: 5100
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 5100
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
note

Both health endpoints are unauthenticated — no API key required. They are excluded from the authentication middleware pipeline.

Rate limit monitoring

Every /flare/v1/* response includes rate limit headers:

HeaderDescription
X-RateLimit-LimitMaximum requests per hour for this project
X-RateLimit-RemainingRequests remaining in the current window
X-RateLimit-ResetUnix timestamp (seconds) when the window resets

Monitor X-RateLimit-Remaining proactively. When it drops below 10% of X-RateLimit-Limit, throttle your request rate to avoid hitting 429.

See the Rate Limits reference for tier details and backoff strategies.

Dashboard

The built-in Zeridion dashboard visualizes all metrics in real time:

  • Overview page — summary cards (total, succeeded, failed, dead letter), throughput chart, state distribution pie chart, queue depth bar chart
  • Jobs list — filterable by state, searchable, with cursor-based pagination
  • Job detail — payload, error details, progress bar, metadata, state badge

The dashboard polls the metrics API automatically using TanStack React Query, so data stays fresh without manual refresh.

Alerting patterns

While built-in alerting (email, Slack) is planned for a future release, you can build custom alerting today by polling the metrics API.

tip

The Flare API returns snake_case JSON (e.g., dead_letter, success_rate). When using GetFromJsonAsync with custom POCOs, pass JsonSerializerOptions with JsonNamingPolicy.SnakeCaseLower or use [JsonPropertyName] attributes to match the API field names.

Building a runbook

A useful alert tells the on-call engineer what to do, not just that something is wrong. For each threshold below, the suggested runbook shape is:

Alert categoryLikely root causesTriage playbook
Dead-letter spikeNew deploy broke a job handler; downstream service (DB, third-party API) is failing for a specific job typeFilter GET /flare/v1/jobs?state=dead_letter for the affected job_type and inspect error_message / error_stack_trace on the most recent rows. If a single error type dominates, roll back or hot-fix; if errors are diverse, suspect a shared downstream.
Queue backlog growthWorker fleet under-scaled for current load; a slow job_type is hogging worker slots; workers are crash-loopingCheck worker pod count vs. pending count, look at processing count (low = workers idle = worker-side problem, high = workers busy = capacity problem). Scale the worker deployment or split slow jobs into their own queue.
Success-rate dropDownstream outage (third-party API down); recent deploy regression; idempotency conflict storm from a buggy clientCross-reference the drop time with recent deploys and external status pages. Inspect error_message distribution to fingerprint the failure.
Probe failure (cluster-wide)DB outage; cluster-level DNS or networking issue; bad config rollout to all podsThis is a readiness incident, not a per-pod restart. Check the DB first, then rollback the most recent config push.
Probe failure (single pod)OOM, deadlock, or runaway thread in one replicaThis is a liveness event — the orchestrator will restart the pod automatically. Investigate post-mortem if it recurs.

When you write a runbook entry, link the alert directly to its Triage playbook row in your team wiki so the paged engineer doesn't have to memorise it.

Dead letter alert

Poll the summary endpoint and alert when dead letter count exceeds a threshold. The threshold should be calibrated to your plan tier and steady-state volume — "10 dead letters per hour" is meaningful for a starter-plan project processing ~1000 jobs/hour (1% dead-letter rate), but trivially common noise for a business-plan project processing millions/hour:

Plan tierSuggested starting threshold (dead letters / hour)Notes
Free> 3Low traffic — any spike is unusual
Starter> 10Equivalent to ~1% of a 1k/hr workload
Pro> 50Tolerate routine noise; alert on burst
Business> 500 (or dead_letter_rate > 1%)Use a rate-based threshold, not an absolute count
private static readonly JsonSerializerOptions JsonOptions = new()
{
PropertyNamingPolicy = JsonNamingPolicy.SnakeCaseLower,
PropertyNameCaseInsensitive = true
};

public class DeadLetterMonitor(HttpClient http, IAlertService alerts) : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
var summary = await http.GetFromJsonAsync<MetricsSummary>(
"/flare/v1/metrics/summary?period=1h", JsonOptions, ct);

// TODO: Adjust threshold based on your plan tier and steady-state job volume.
// Starting point: 10 for starter, 50 for pro, rate-based (>1%) for business.
if (summary?.DeadLetter > 10)
{
await alerts.SendAsync(
$"Dead letter alert: {summary.DeadLetter} dead-lettered jobs in the last hour");
}

await Task.Delay(TimeSpan.FromMinutes(5), ct);
}
}
}

Queue backlog alert

Monitor queue depth and alert when pending jobs accumulate beyond capacity. Pending-job thresholds scale with your worker concurrency and acceptable end-to-end latency SLO:

Plan tier (or worker fleet)Suggested starting threshold (pending / queue)Notes
Free / single worker> 100Backlog above ~10× concurrency means you're losing ground
Starter / 1–2 workers> 250
Pro / 5–10 workers> 1000
Business / 10+ workers> 5000 (or pending / claimed_per_minute > SLO_minutes)Express as time-to-drain, not absolute count
var queues = await http.GetFromJsonAsync<QueueDepthResponse>(
"/flare/v1/metrics/queues", JsonOptions, ct);

foreach (var queue in queues?.Queues ?? [])
{
// TODO: Adjust based on your worker fleet size and end-to-end latency SLO.
// Express as "time to drain" (pending / claim-rate) when you have stable throughput data.
if (queue.Pending > 1000)
{
await alerts.SendAsync(
$"Queue backlog: {queue.Name} has {queue.Pending} pending jobs");
}
}

Success rate drop

Alert when the success rate falls below an acceptable threshold. Success-rate thresholds are SLO-driven, not plan-tier-driven — but the floor you can practically defend varies with traffic volume (tiny denominators produce noisy rates):

Steady-state job volume / hourSuggested success-rate floorNotes
< 100< 0.80Tiny denominator — a single failure swings the rate by 1pp
100 – 1k< 0.90
1k – 100k< 0.95Common SLO target
> 100k< 0.99 (or per-job_type rate)Aggregate hides individual-handler regressions; alert per-type
var summary = await http.GetFromJsonAsync<MetricsSummary>(
"/flare/v1/metrics/summary?period=1h", JsonOptions, ct);

// TODO: Adjust based on your SLO. Floor at 0.80 only makes sense for tiny denominators —
// most production workloads should target 0.95+ aggregate or per-job-type.
if (summary is not null && summary.SuccessRate < 0.95)
{
await alerts.SendAsync(
$"Success rate dropped to {summary.SuccessRate:P1} in the last hour");
}

Quota monitoring

Every POST /flare/v1/jobs response includes X-Quota-* headers that tell you how much of your daily job creation quota has been used:

HeaderDescription
X-Quota-LimitMaximum jobs per day for this project
X-Quota-UsedJobs created today (UTC)
X-Quota-ResetUnix timestamp of next midnight UTC

Proactive quota monitoring

Monitor the ratio of X-Quota-Used to X-Quota-Limit on job creation responses. When usage exceeds 80% of the daily limit, consider upgrading your plan or deferring non-critical jobs.

var response = await httpClient.PostAsync("/flare/v1/jobs", content);

if (response.Headers.TryGetValues("X-Quota-Used", out var usedValues)
&& response.Headers.TryGetValues("X-Quota-Limit", out var limitValues)
&& int.TryParse(usedValues.First(), out var used)
&& int.TryParse(limitValues.First(), out var limit))
{
var usagePercent = (double)used / limit;
if (usagePercent > 0.8)
{
logger.LogWarning("Daily quota at {Percent:P0} ({Used}/{Limit})",
usagePercent, used, limit);
}
}

If you receive a 429 with error code monthly_allowance_exceeded, the allowance resets at the time indicated by X-Quota-Reset (the billing-period end). See Rate Limits for full details.

See also