Observability
Snowpack exposes metrics via OpenTelemetry, emits structured logs via structlog, and surfaces operational data through a hybrid Grafana dashboard.
Metrics
The API and worker processes expose an OTel Prometheus endpoint on :9464. The cluster-wide OTel Collector scrapes this endpoint and forwards metrics to Mimir.
Metric inventory
| Metric | Type | Labels | Description |
|---|---|---|---|
snowpack.job.duration | Histogram | database, table, status | End-to-end job wall-clock time |
snowpack.job.total | Counter | database, table, status | Total jobs by terminal status |
snowpack.action.duration | Histogram | database, table, action, success | Per-action wall-clock time |
snowpack.action.total | Counter | database, table, action, success | Total action executions |
snowpack.queue.depth | Observable Gauge | Number of unclaimed, visible jobs in the queue | |
snowpack.workers.active | Observable Gauge | Count of distinct claimed_by values (active workers) | |
snowpack.tables.discovered | Gauge | Number of tables in the table cache |
Observable gauges
snowpack.queue.depth and snowpack.workers.active are observable gauges — they re-query Postgres on every Prometheus scrape rather than maintaining in-memory counters. This design eliminates state drift across API replicas: every scrape returns the true current value from the database, regardless of which replica serves the request.
Structured logging
All log output uses structlog with JSON formatting. Key structured events emitted across the system:
| Event | Emitted by | Description |
|---|---|---|
job_started | Worker | Job claim succeeded, execution beginning |
job_completed | Worker | All actions finished successfully |
job_failed | Worker | Job reached terminal failure |
job_crashed | Worker | Unhandled exception during execution |
table_cache_synced | API / TableCacheSyncWorker | Table cache refreshed from catalog |
executor_started | Worker | Spark query engine connected |
health_sync_started | Health Sync | Health sync cycle beginning |
health_sync_discovered | Health Sync | Tables discovered from catalog |
health_sync_collected | Health Sync | Health data collected for tables |
health_sync_pg_written | Health Sync | Health snapshots persisted to Postgres |
Grafana dashboard
Snowpack uses a hybrid Postgres + Athena Grafana dashboard:
- Postgres panels show live operational data — active jobs, queue depth, recent failures, lock status.
- Athena panels query historical job and action data for longer-range trend analysis (e.g., compaction duration percentiles over weeks, action success rates by table).
This split keeps the live dashboard responsive (Postgres queries are fast for small working sets) while supporting deep historical analysis without burdening the operational database.