Troubleshooting

This page covers the most common failure modes, their symptoms, and how to resolve them.

Workers not scaling

Symptom: The job_queue table has unclaimed rows, but no worker pods are spawning. kubectl get pods -n snowpack shows no worker pods.

Cause: The KEDA ScaledJob is not triggering. This usually means the postgresql trigger cannot connect to the database, or the trigger query is not returning the expected result.

Diagnosis:

Check the ScaledJob status:
Terminal window
```
kubectl get scaledjob -n snowpack
```
Look at the READY column. If it shows False, KEDA cannot evaluate the trigger.
Check KEDA operator logs for connection errors:
Terminal window
```
kubectl logs -n keda -l app=keda-operator --tail=50
```

Verify the job_queue has unclaimed work:

SELECT COUNT(*) FROM job_queue
WHERE claimed_at IS NULL AND visible_at <= NOW();

Verify activationTargetQueryValue is set in the ScaledJob trigger metadata. KEDA 2.12+ requires activationTargetQueryValue (not activationLagCount) to activate from zero replicas. Without it, KEDA will not scale up from zero even when there is work in the queue.

Resolution: Fix the KEDA trigger authentication (check the Secret referenced by TriggerAuthentication), verify Postgres connectivity from the KEDA namespace, and confirm the activationTargetQueryValue is present.

Jobs stuck in pending

Symptom: Jobs show status: pending for longer than expected. Workers may or may not be running.

Cause: Several possible causes:

KEDA polling interval is 30 seconds, so there is an inherent delay between a job being queued and a worker pod starting.
The visible_at timestamp on the queue row may be in the future (retry backoff).
A stale claim from a crashed worker may be blocking the row. The reclaim_stale sweeper releases claims older than 30 minutes, but this requires the API process to be running.

Diagnosis:

Check queue row timestamps:

SELECT job_id, visible_at, claimed_at
FROM job_queue
WHERE claimed_at IS NULL
ORDER BY visible_at;

Check for stale claims (claimed but not progressing):

SELECT job_id, claimed_at
FROM job_queue
WHERE claimed_at IS NOT NULL
  AND claimed_at < NOW() - INTERVAL '30 minutes';

Verify the API is running (the reclaim_stale sweeper runs inside the API process):
Terminal window
```
kubectl get pods -n snowpack -l app.kubernetes.io/component=api
```

Resolution: If stale claims exist and the API is running, the sweeper will reclaim them within 30 seconds. If the API is not running, fix the API first — the sweeper cannot run without it. For jobs stuck behind a future visible_at, wait for the backoff window to expire.

Health sync OOM

Symptom: The health-sync CronJob pod is OOMKilled. kubectl describe pod shows the container exceeded its memory limit.

Cause: PyIceberg loads table metadata into memory. With high concurrency and many large tables, the combined memory footprint exceeds the pod’s limit. This was tracked in DL-278.

Diagnosis:

kubectl get pods -n snowpack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTime
kubectl describe pod <oom-killed-pod> -n snowpack

Look for Last State: Terminated with Reason: OOMKilled and check the memory limit in the container spec.

Resolution: Reduce the SNOWPACK_HEALTH_SYNC_CONCURRENCY setting. For the dev environment the concurrency is set to 2 (down from the default 10). In the Helm values:

healthSync:
  concurrency: 2
  resources:
    limits:
      memory: 768Mi

If the problem persists even at concurrency 2, increase the memory limit rather than lowering concurrency further — at concurrency 1 the sync window may exceed the 15-minute CronJob interval.

Table not appearing in orchestrator

Symptom: A table has snowpack.maintenance_enabled = true set as a table property, but the orchestrator never submits maintenance for it.

Cause: The orchestrator only processes tables that satisfy all three conditions:

The table’s database is listed in healthSync.databases.
The table has snowpack.maintenance_enabled = true as a table property.
The table’s database is listed in orchestrator.includeDatabases.

If any condition is not met, the orchestrator will skip the table silently.

Diagnosis:

Verify the table appears in the table cache:
Terminal window
```
curl -s https://<snowpack-host>/tables?database=<database>&maintenance_enabled=true | jq .
```
If the table is not in the response, health-sync has not picked it up.
Check healthSync.databases in the Helm values:
Terminal window
```
helm get values snowpack -n snowpack | grep -A5 healthSync
```
The table’s database must be in this list.
Check orchestrator.includeDatabases:
Terminal window
```
helm get values snowpack -n snowpack | grep -A5 orchestrator
```
orchestrator.includeDatabases must be a subset of healthSync.databases. If a database is in the orchestrator list but not in the health-sync list, the opt-in flags on new tables will never reach the table cache and the orchestrator will skip them.

Resolution: Add the database to both healthSync.databases and orchestrator.includeDatabases in the Helm values, then terraform apply.

409 Conflict on maintenance submit

Symptom: POST /tables/{db}/{table}/maintenance returns 409 Conflict with the message “Maintenance already in progress for {db}.{table}”.

Cause: Another job currently holds the lock for this table. Snowpack uses a table_locks table to ensure only one maintenance job runs per table at a time. The lock is acquired when a job is submitted and released when the job completes, fails, or is cancelled.

Diagnosis:

Check who holds the lock:

SELECT table_key, holder, acquired_at, expires_at
FROM table_locks
WHERE table_key = '<database>.<table>';

Check the status of the holding job:

curl -s https://<snowpack-host>/jobs/<holder-job-id> | jq .status

Resolution: If the holding job is still running, wait for it to complete. If the holding job has already finished but the lock was not released (crash during cleanup), the reclaim_stale sweeper will release it when the lock’s expires_at has passed. To release a stale lock immediately, cancel the holding job via POST /jobs/{id}/cancel.

Stale table cache

Symptom: The API returns outdated table lists, or newly opted-in tables are not appearing in API responses.

Cause: The table cache is populated by the health-sync CronJob, which runs every 15 minutes. If the CronJob has not completed recently, the cache may be stale.

Diagnosis:

Check the cache status endpoint for the last sync timestamp:

curl -s https://<snowpack-host>/tables/cache-status | jq .

The response includes:

{
  "last_synced": "2026-04-25T12:15:00+00:00",
  "table_count": 142
}

If last_synced is more than 15-20 minutes old, the health-sync CronJob may be failing.

Resolution:

Check health-sync CronJob history:

kubectl get cronjob -n snowpack -l app.kubernetes.io/component=health-sync
kubectl get jobs -n snowpack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTime

If the CronJob is failing, check pod logs for the most recent failed run:
Terminal window
```
kubectl logs -n snowpack -l app.kubernetes.io/component=health-sync --tail=100
```
Common causes include OOM kills (see Health sync OOM above), Glue API throttling, or Postgres connection failures. Fix the underlying issue and the next CronJob run will repopulate the cache.