Skip to content

Troubleshooting

This page covers the most common failure modes, their symptoms, and how to resolve them.

Workers not scaling

Symptom: The job_queue table has unclaimed rows, but no worker pods are spawning. kubectl get pods -n snowpack shows no worker pods.

Cause: The KEDA ScaledJob is not triggering. This usually means the postgresql trigger cannot connect to the database, or the trigger query is not returning the expected result.

Diagnosis:

  1. Check the ScaledJob status:

    Terminal window
    kubectl get scaledjob -n snowpack

    Look at the READY column. If it shows False, KEDA cannot evaluate the trigger.

  2. Check KEDA operator logs for connection errors:

    Terminal window
    kubectl logs -n keda -l app=keda-operator --tail=50
  3. Verify the job_queue has unclaimed work:

    SELECT COUNT(*) FROM job_queue
    WHERE claimed_at IS NULL AND visible_at <= NOW();
  4. Verify activationTargetQueryValue is set in the ScaledJob trigger metadata. KEDA 2.12+ requires activationTargetQueryValue (not activationLagCount) to activate from zero replicas. Without it, KEDA will not scale up from zero even when there is work in the queue.

Resolution: Fix the KEDA trigger authentication (check the Secret referenced by TriggerAuthentication), verify Postgres connectivity from the KEDA namespace, and confirm the activationTargetQueryValue is present.

Jobs stuck in pending

Symptom: Jobs show status: pending for longer than expected. Workers may or may not be running.

Cause: Several possible causes:

  • KEDA polling interval is 30 seconds, so there is an inherent delay between a job being queued and a worker pod starting.
  • The visible_at timestamp on the queue row may be in the future (retry backoff).
  • A stale claim from a crashed worker may be blocking the row. The reclaim_stale sweeper releases claims older than 30 minutes, but this requires the API process to be running.

Diagnosis:

  1. Check queue row timestamps:

    SELECT job_id, visible_at, claimed_at
    FROM job_queue
    WHERE claimed_at IS NULL
    ORDER BY visible_at;
  2. Check for stale claims (claimed but not progressing):

    SELECT job_id, claimed_at
    FROM job_queue
    WHERE claimed_at IS NOT NULL
    AND claimed_at < NOW() - INTERVAL '30 minutes';
  3. Verify the API is running (the reclaim_stale sweeper runs inside the API process):

    Terminal window
    kubectl get pods -n snowpack -l app.kubernetes.io/component=api

Resolution: If stale claims exist and the API is running, the sweeper will reclaim them within 30 seconds. If the API is not running, fix the API first — the sweeper cannot run without it. For jobs stuck behind a future visible_at, wait for the backoff window to expire.

Health sync OOM

Symptom: The health-sync CronJob pod is OOMKilled. kubectl describe pod shows the container exceeded its memory limit.

Cause: PyIceberg loads table metadata into memory. With high concurrency and many large tables, the combined memory footprint exceeds the pod’s limit. This was tracked in DL-278.

Diagnosis:

Terminal window
kubectl get pods -n snowpack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTime
kubectl describe pod <oom-killed-pod> -n snowpack

Look for Last State: Terminated with Reason: OOMKilled and check the memory limit in the container spec.

Resolution: Reduce the SNOWPACK_HEALTH_SYNC_CONCURRENCY setting. For the dev environment the concurrency is set to 2 (down from the default 10). In the Helm values:

healthSync:
concurrency: 2
resources:
limits:
memory: 768Mi

If the problem persists even at concurrency 2, increase the memory limit rather than lowering concurrency further — at concurrency 1 the sync window may exceed the 15-minute CronJob interval.

Table not appearing in orchestrator

Symptom: A table has snowpack.maintenance_enabled = true set as a table property, but the orchestrator never submits maintenance for it.

Cause: The orchestrator only processes tables that satisfy all three conditions:

  1. The table’s database is listed in healthSync.databases.
  2. The table has snowpack.maintenance_enabled = true as a table property.
  3. The table’s database is listed in orchestrator.includeDatabases.

If any condition is not met, the orchestrator will skip the table silently.

Diagnosis:

  1. Verify the table appears in the table cache:

    Terminal window
    curl -s https://<snowpack-host>/tables?database=<database>&maintenance_enabled=true | jq .

    If the table is not in the response, health-sync has not picked it up.

  2. Check healthSync.databases in the Helm values:

    Terminal window
    helm get values snowpack -n snowpack | grep -A5 healthSync

    The table’s database must be in this list.

  3. Check orchestrator.includeDatabases:

    Terminal window
    helm get values snowpack -n snowpack | grep -A5 orchestrator

    orchestrator.includeDatabases must be a subset of healthSync.databases. If a database is in the orchestrator list but not in the health-sync list, the opt-in flags on new tables will never reach the table cache and the orchestrator will skip them.

Resolution: Add the database to both healthSync.databases and orchestrator.includeDatabases in the Helm values, then terraform apply.

409 Conflict on maintenance submit

Symptom: POST /tables/{db}/{table}/maintenance returns 409 Conflict with the message “Maintenance already in progress for {db}.{table}”.

Cause: Another job currently holds the lock for this table. Snowpack uses a table_locks table to ensure only one maintenance job runs per table at a time. The lock is acquired when a job is submitted and released when the job completes, fails, or is cancelled.

Diagnosis:

  1. Check who holds the lock:

    SELECT table_key, holder, acquired_at, expires_at
    FROM table_locks
    WHERE table_key = '<database>.<table>';
  2. Check the status of the holding job:

    Terminal window
    curl -s https://<snowpack-host>/jobs/<holder-job-id> | jq .status

Resolution: If the holding job is still running, wait for it to complete. If the holding job has already finished but the lock was not released (crash during cleanup), the reclaim_stale sweeper will release it when the lock’s expires_at has passed. To release a stale lock immediately, cancel the holding job via POST /jobs/{id}/cancel.

Stale table cache

Symptom: The API returns outdated table lists, or newly opted-in tables are not appearing in API responses.

Cause: The table cache is populated by the health-sync CronJob, which runs every 15 minutes. If the CronJob has not completed recently, the cache may be stale.

Diagnosis:

Check the cache status endpoint for the last sync timestamp:

Terminal window
curl -s https://<snowpack-host>/tables/cache-status | jq .

The response includes:

{
"last_synced": "2026-04-25T12:15:00+00:00",
"table_count": 142
}

If last_synced is more than 15-20 minutes old, the health-sync CronJob may be failing.

Resolution:

  1. Check health-sync CronJob history:

    Terminal window
    kubectl get cronjob -n snowpack -l app.kubernetes.io/component=health-sync
    kubectl get jobs -n snowpack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTime
  2. If the CronJob is failing, check pod logs for the most recent failed run:

    Terminal window
    kubectl logs -n snowpack -l app.kubernetes.io/component=health-sync --tail=100
  3. Common causes include OOM kills (see Health sync OOM above), Glue API throttling, or Postgres connection failures. Fix the underlying issue and the next CronJob run will repopulate the cache.