Monitoring

Zeridion Flare provides three monitoring pillars out of the box: OpenTelemetry tracing, a Metrics API for custom dashboards, and health endpoints for container orchestration.

OpenTelemetry tracing

The Flare API ships with OpenTelemetry pre-configured. Every request generates distributed traces with automatic instrumentation for:

ASP.NET Core — HTTP request spans with route, status code, and duration
HttpClient — outbound HTTP call spans
Database client — durable-storage query spans with command text and duration

Tenant tagging

Authenticated /flare/v1/* request spans are enriched with tenant context:

Tag	Value	Description
`tenant.id`	Project ID	Identifies which project (customer) owns the request
`tenant.plan`	Plan name	The project's pricing tier (free, starter, pro, business)

This makes it straightforward to filter traces and metrics by tenant in your observability platform.

Trace export

Spans emitted by your application code are visible in your own observability stack via standard OpenTelemetry exporters. Flare propagates the W3C traceparent header end-to-end so that your client traces, the Flare API request, and your worker's job execution all roll up into a single distributed trace.

OpenTelemetry export configuration

Point your worker (and any Flare API instance you self-host) at an OTLP collector by setting the standard OTEL_EXPORTER_OTLP_* environment variables. The .NET host honours them out of the box:

# OTLP endpoint — most collectors expose /v1/traces on port 4317 (gRPC) or 4318 (HTTP)
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otel.example.com:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"      # or "grpc"
export OTEL_EXPORTER_OTLP_HEADERS="api-key=$OTEL_API_KEY"
export OTEL_SERVICE_NAME="my-worker"                    # shows up as service.name on every span

Then register the exporter in your worker's Program.cs alongside the SDK:

using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

builder.Services.AddZeridionFlare(o =>
{
    o.ApiKey = builder.Configuration["FLARE_API_KEY"]!;
});

builder.Services
    .AddOpenTelemetry()
    .ConfigureResource(r => r.AddService(serviceName: "my-worker"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSource("Zeridion.Flare.Sdk")    // worker poll/ack/heartbeat spans
        .AddOtlpExporter());                // reads OTEL_EXPORTER_OTLP_* from env

Every /flare/v1/* request span and worker SDK span carries the tenant.id and tenant.plan attributes automatically — no manual enrichment needed. You can filter traces by tenant in your observability platform's UI, or add a tail-based sampler that keeps 100% of spans for a specific tenant.id when triaging a customer incident.

Metrics API

Three endpoints provide aggregate job metrics, all scoped to the authenticated project. Use them to build custom dashboards, feed alerting systems, or integrate with external monitoring tools.

Summary

GET /flare/v1/metrics/summary?period=24h

Returns state counts, success rate, and average duration for the specified period.

Query parameters:

Parameter	Values	Default	Description
`period`	`1h`, `24h`, `7d`, `30d`	`24h`	Time window for aggregation

Response:

{
  "total": 1523,
  "pending": 12,
  "scheduled": 3,
  "processing": 8,
  "succeeded": 1450,
  "failed": 5,
  "cancelled": 20,
  "dead_letter": 25,
  "success_rate": 0.9797,
  "avg_duration_ms": 342.5,
  "period": "24h"
}

Field	Description
`success_rate`	`succeeded / (succeeded + failed + dead_letter)`, rounded to 4 decimal places
`avg_duration_ms`	Average execution time across jobs that reported `duration_ms`, or `null` if none

Throughput

GET /flare/v1/metrics/throughput?period=7d&granularity=hour

Returns time-bucketed counts of enqueued, succeeded, and failed jobs.

Query parameters:

Parameter	Values	Default	Description
`period`	`1h`, `24h`, `7d`, `30d`	`24h`	Time window
`granularity`	`minute`, `hour`, `day`	Auto	Bucket size

Auto-granularity (when granularity is omitted):

Period	Default granularity
`1h`	`minute`
`24h`	`hour`
`7d`	`hour`
`30d`	`day`

Response:

{
  "period": "7d",
  "granularity": "hour",
  "data": [
    {
      "timestamp": "2026-03-20T00:00:00+00:00",
      "enqueued": 45,
      "succeeded": 42,
      "failed": 1
    },
    {
      "timestamp": "2026-03-20T01:00:00+00:00",
      "enqueued": 38,
      "succeeded": 37,
      "failed": 0
    }
  ]
}

Queue depth

GET /flare/v1/metrics/queues

Returns the current depth of each queue — how many jobs are pending, processing, or scheduled per queue.

Response:

{
  "queues": [
    {
      "name": "default",
      "pending": 15,
      "processing": 5,
      "scheduled": 2
    },
    {
      "name": "email",
      "pending": 3,
      "processing": 1,
      "scheduled": 0
    }
  ]
}

Use queue depth for autoscaling decisions: when pending grows faster than processing can drain it, you need more workers on that queue.

Health endpoints

Two unauthenticated endpoints for container orchestration probes. Understand the semantic difference before wiring them up:

Liveness failure → pod restart. A failing liveness probe tells the orchestrator the process is wedged and must be killed. Use this for catastrophic in-process state (deadlock, OOM-induced unresponsiveness) where restarting fixes things.
Readiness failure → traffic removal, cluster-level incident. A failing readiness probe tells the orchestrator to stop sending traffic but leave the pod alive. When every replica goes unready simultaneously (e.g. the database is down), that's a cluster-wide outage — restarting won't help, and an aggressive liveness probe would just crash-loop every pod into the same downstream failure.

Liveness

GET /health/live

Returns 200 Healthy if the process is running and can accept HTTP requests. No external dependencies are checked — this is purely a process liveness signal.

Use as: Kubernetes liveness probe, Azure Container Apps liveness probe.

Readiness

GET /health/ready

Returns 200 Healthy if the process can serve traffic, including verifying that durable storage is reachable. Returns 503 Unhealthy if the storage connection fails.

Use as: Kubernetes readiness probe, Azure Container Apps readiness probe, load balancer health check.

Recommended probe configuration

These values balance fast-failure detection against transient flakiness. Tune for your environment, but keep readiness more forgiving than liveness so a 30-second DB blip doesn't crash-loop the pod fleet.

Probe	`initialDelaySeconds`	`periodSeconds`	`timeoutSeconds`	`failureThreshold`	Effective fail-over time
Liveness (`/health/live`)	30	10	3	3	~60s before the pod is killed (delay + 3× period)
Readiness (`/health/ready`)	15	5	3	3	~30s before the pod is taken out of rotation

# Kubernetes deployment snippet
livenessProbe:
  httpGet:
    path: /health/live
    port: 5100
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /health/ready
    port: 5100
  initialDelaySeconds: 15
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

note

Both health endpoints are unauthenticated — no API key required. They are excluded from the authentication middleware pipeline.

Rate limit monitoring

Every /flare/v1/* response includes rate limit headers:

Header	Description
`X-RateLimit-Limit`	Maximum requests per hour for this project
`X-RateLimit-Remaining`	Requests remaining in the current window
`X-RateLimit-Reset`	Unix timestamp (seconds) when the window resets

Monitor X-RateLimit-Remaining proactively. When it drops below 10% of X-RateLimit-Limit, throttle your request rate to avoid hitting 429.

See the Rate Limits reference for tier details and backoff strategies.

Dashboard

The built-in Zeridion dashboard visualizes all metrics in real time:

Overview page — summary cards (total, succeeded, failed, dead letter), throughput chart, state distribution pie chart, queue depth bar chart
Jobs list — filterable by state, searchable, with cursor-based pagination
Job detail — payload, error details, progress bar, metadata, state badge

The dashboard polls the metrics API automatically using TanStack React Query, so data stays fresh without manual refresh.

Alerting patterns

While built-in alerting (email, Slack) is planned for a future release, you can build custom alerting today by polling the metrics API.

tip

The Flare API returns snake_case JSON (e.g., dead_letter, success_rate). When using GetFromJsonAsync with custom POCOs, pass JsonSerializerOptions with JsonNamingPolicy.SnakeCaseLower or use [JsonPropertyName] attributes to match the API field names.

Building a runbook

A useful alert tells the on-call engineer what to do, not just that something is wrong. For each threshold below, the suggested runbook shape is:

Alert category	Likely root causes	Triage playbook
Dead-letter spike	New deploy broke a job handler; downstream service (DB, third-party API) is failing for a specific job type	Filter `GET /flare/v1/jobs?state=dead_letter` for the affected `job_type` and inspect `error_message` / `error_stack_trace` on the most recent rows. If a single error type dominates, roll back or hot-fix; if errors are diverse, suspect a shared downstream.
Queue backlog growth	Worker fleet under-scaled for current load; a slow `job_type` is hogging worker slots; workers are crash-looping	Check worker pod count vs. `pending` count, look at `processing` count (low = workers idle = worker-side problem, high = workers busy = capacity problem). Scale the worker deployment or split slow jobs into their own queue.
Success-rate drop	Downstream outage (third-party API down); recent deploy regression; idempotency conflict storm from a buggy client	Cross-reference the drop time with recent deploys and external status pages. Inspect `error_message` distribution to fingerprint the failure.
Probe failure (cluster-wide)	DB outage; cluster-level DNS or networking issue; bad config rollout to all pods	This is a readiness incident, not a per-pod restart. Check the DB first, then rollback the most recent config push.
Probe failure (single pod)	OOM, deadlock, or runaway thread in one replica	This is a liveness event — the orchestrator will restart the pod automatically. Investigate post-mortem if it recurs.

When you write a runbook entry, link the alert directly to its Triage playbook row in your team wiki so the paged engineer doesn't have to memorise it.

Dead letter alert

Poll the summary endpoint and alert when dead letter count exceeds a threshold. The threshold should be calibrated to your plan tier and steady-state volume — "10 dead letters per hour" is meaningful for a starter-plan project processing ~1000 jobs/hour (1% dead-letter rate), but trivially common noise for a business-plan project processing millions/hour:

Plan tier	Suggested starting threshold (dead letters / hour)	Notes
Free	`> 3`	Low traffic — any spike is unusual
Starter	`> 10`	Equivalent to ~1% of a 1k/hr workload
Pro	`> 50`	Tolerate routine noise; alert on burst
Business	`> 500` (or `dead_letter_rate > 1%`)	Use a rate-based threshold, not an absolute count

private static readonly JsonSerializerOptions JsonOptions = new()
{
    PropertyNamingPolicy = JsonNamingPolicy.SnakeCaseLower,
    PropertyNameCaseInsensitive = true
};

public class DeadLetterMonitor(HttpClient http, IAlertService alerts) : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        while (!ct.IsCancellationRequested)
        {
            var summary = await http.GetFromJsonAsync<MetricsSummary>(
                "/flare/v1/metrics/summary?period=1h", JsonOptions, ct);

            // TODO: Adjust threshold based on your plan tier and steady-state job volume.
            // Starting point: 10 for starter, 50 for pro, rate-based (>1%) for business.
            if (summary?.DeadLetter > 10)
            {
                await alerts.SendAsync(
                    $"Dead letter alert: {summary.DeadLetter} dead-lettered jobs in the last hour");
            }

            await Task.Delay(TimeSpan.FromMinutes(5), ct);
        }
    }
}

Queue backlog alert

Monitor queue depth and alert when pending jobs accumulate beyond capacity. Pending-job thresholds scale with your worker concurrency and acceptable end-to-end latency SLO:

Plan tier (or worker fleet)	Suggested starting threshold (pending / queue)	Notes
Free / single worker	`> 100`	Backlog above ~10× concurrency means you're losing ground
Starter / 1–2 workers	`> 250`
Pro / 5–10 workers	`> 1000`
Business / 10+ workers	`> 5000` (or `pending / claimed_per_minute > SLO_minutes`)	Express as time-to-drain, not absolute count

var queues = await http.GetFromJsonAsync<QueueDepthResponse>(
    "/flare/v1/metrics/queues", JsonOptions, ct);

foreach (var queue in queues?.Queues ?? [])
{
    // TODO: Adjust based on your worker fleet size and end-to-end latency SLO.
    // Express as "time to drain" (pending / claim-rate) when you have stable throughput data.
    if (queue.Pending > 1000)
    {
        await alerts.SendAsync(
            $"Queue backlog: {queue.Name} has {queue.Pending} pending jobs");
    }
}

Success rate drop

Alert when the success rate falls below an acceptable threshold. Success-rate thresholds are SLO-driven, not plan-tier-driven — but the floor you can practically defend varies with traffic volume (tiny denominators produce noisy rates):

Steady-state job volume / hour	Suggested success-rate floor	Notes
`< 100`	`< 0.80`	Tiny denominator — a single failure swings the rate by 1pp
`100 – 1k`	`< 0.90`
`1k – 100k`	`< 0.95`	Common SLO target
`> 100k`	`< 0.99` (or per-`job_type` rate)	Aggregate hides individual-handler regressions; alert per-type

var summary = await http.GetFromJsonAsync<MetricsSummary>(
    "/flare/v1/metrics/summary?period=1h", JsonOptions, ct);

// TODO: Adjust based on your SLO. Floor at 0.80 only makes sense for tiny denominators —
// most production workloads should target 0.95+ aggregate or per-job-type.
if (summary is not null && summary.SuccessRate < 0.95)
{
    await alerts.SendAsync(
        $"Success rate dropped to {summary.SuccessRate:P1} in the last hour");
}

Quota monitoring

Every POST /flare/v1/jobs response includes X-Quota-* headers that tell you how much of your monthly job allowance has been used:

Header	Description
`X-Quota-Limit`	Job allowance for the current billing period (`-1` = unlimited)
`X-Quota-Used`	Jobs created this billing period
`X-Quota-Overage`	Jobs created beyond the allowance this period (opt-in overage)
`X-Quota-Overage-Cost`	Accrued overage cost this period, in cents
`X-Quota-Spend-Cap`	Your overage spend cap in cents — only present when a cap is set
`X-Quota-Reset`	Unix timestamp of the period end (billing-period end; month-end UTC on Free)

Proactive quota monitoring

Monitor the ratio of X-Quota-Used to X-Quota-Limit on job creation responses. When usage exceeds 80% of the monthly allowance, consider enabling opt-in overage, upgrading your plan, or deferring non-critical jobs.

var response = await httpClient.PostAsync("/flare/v1/jobs", content);

if (response.Headers.TryGetValues("X-Quota-Used", out var usedValues)
    && response.Headers.TryGetValues("X-Quota-Limit", out var limitValues)
    && int.TryParse(usedValues.First(), out var used)
    && int.TryParse(limitValues.First(), out var limit))
{
    var usagePercent = (double)used / limit;
    if (usagePercent > 0.8)
    {
        logger.LogWarning("Monthly allowance at {Percent:P0} ({Used}/{Limit})",
            usagePercent, used, limit);
    }
}

If you receive a 429 with error code monthly_allowance_exceeded, the allowance resets at the time indicated by X-Quota-Reset (the billing-period end). See Rate Limits for full details.

OpenTelemetry tracing​

Tenant tagging​

Trace export​

OpenTelemetry export configuration​

Metrics API​

Summary​

Throughput​

Queue depth​

Health endpoints​

Liveness​

Readiness​

Recommended probe configuration​

Rate limit monitoring​

Dashboard​

Alerting patterns​

Building a runbook​

Dead letter alert​

Queue backlog alert​

Success rate drop​

Quota monitoring​

Proactive quota monitoring​

See also​

OpenTelemetry tracing

Tenant tagging

Trace export

OpenTelemetry export configuration

Metrics API

Summary

Throughput

Queue depth

Health endpoints

Liveness

Readiness

Recommended probe configuration

Rate limit monitoring

Dashboard

Alerting patterns

Building a runbook

Dead letter alert

Queue backlog alert

Success rate drop

Quota monitoring

Proactive quota monitoring

See also