Skip to content

Observability

Snowpack exposes metrics via OpenTelemetry, emits structured logs via structlog, and surfaces operational data through a hybrid Grafana dashboard.

Metrics

The API and worker processes expose an OTel Prometheus endpoint on :9464. The cluster-wide OTel Collector scrapes this endpoint and forwards metrics to Mimir.

Metric inventory

MetricTypeLabelsDescription
snowpack.job.durationHistogramdatabase, table, statusEnd-to-end job wall-clock time
snowpack.job.totalCounterdatabase, table, statusTotal jobs by terminal status
snowpack.action.durationHistogramdatabase, table, action, successPer-action wall-clock time
snowpack.action.totalCounterdatabase, table, action, successTotal action executions
snowpack.queue.depthObservable GaugeNumber of unclaimed, visible jobs in the queue
snowpack.workers.activeObservable GaugeCount of distinct claimed_by values (active workers)
snowpack.tables.discoveredGaugeNumber of tables in the table cache

Observable gauges

snowpack.queue.depth and snowpack.workers.active are observable gauges — they re-query Postgres on every Prometheus scrape rather than maintaining in-memory counters. This design eliminates state drift across API replicas: every scrape returns the true current value from the database, regardless of which replica serves the request.

Structured logging

All log output uses structlog with JSON formatting. Key structured events emitted across the system:

EventEmitted byDescription
job_startedWorkerJob claim succeeded, execution beginning
job_completedWorkerAll actions finished successfully
job_failedWorkerJob reached terminal failure
job_crashedWorkerUnhandled exception during execution
table_cache_syncedAPI / TableCacheSyncWorkerTable cache refreshed from catalog
executor_startedWorkerSpark query engine connected
health_sync_startedHealth SyncHealth sync cycle beginning
health_sync_discoveredHealth SyncTables discovered from catalog
health_sync_collectedHealth SyncHealth data collected for tables
health_sync_pg_writtenHealth SyncHealth snapshots persisted to Postgres

Grafana dashboard

Snowpack uses a hybrid Postgres + Athena Grafana dashboard:

  • Postgres panels show live operational data — active jobs, queue depth, recent failures, lock status.
  • Athena panels query historical job and action data for longer-range trend analysis (e.g., compaction duration percentiles over weeks, action success rates by table).

This split keeps the live dashboard responsive (Postgres queries are fast for small working sets) while supporting deep historical analysis without burdening the operational database.