Skip to content

Drain Mode

Drain mode is a cluster-wide pause switch that quiesces Snowpack for planned maintenance windows. When enabled, the API rejects new maintenance submissions and the internal stale-lock sweeper is paused.

What it does

When drainMode is set to "on":

  • POST /tables/{db}/{table}/maintenance returns 503 with {"error": "drain_mode"}. All read endpoints (GET /tables, GET /jobs, /healthz, /readyz) continue to work normally.
  • The reclaim_stale sweeper pauses. This background thread normally reclaims locks from workers that died without releasing them (SIGKILL, OOM, node crash). During drain mode the sweeper is paused so it does not release locks from workers that are still running their final actions as the cluster quiesces.

Existing running jobs are not cancelled. They are allowed to finish naturally. Drain mode only prevents new work from being submitted.

When to use

Enable drain mode before any operation that could disrupt in-flight maintenance work:

  • Postgres maintenance or upgrades (schema migrations, version bumps, PVC resize)
  • Spark Thrift Server / Kyuubi maintenance windows
  • Snowpack version upgrades that require a clean queue

How to enable

Set drainMode to "on" in the Helm values and apply via Terraform:

# values-dev.yaml (or the environment-specific values file)
drainMode: "on"
Terminal window
cd terraform/snowpack-api/env/dev
terraform apply

The API pods will restart with SNOWPACK_DRAIN_MODE=on and immediately begin rejecting new submissions.

Pre-enable checklist

Before enabling drain mode, make sure in-flight work has completed:

  1. Check for running jobs:

    Terminal window
    curl -s https://<snowpack-host>/jobs?status=running | jq .

    If any jobs are still running, wait for them to complete or cancel them explicitly before proceeding.

  2. Verify the job queue is empty:

    Terminal window
    curl -s https://<snowpack-host>/jobs?status=pending | jq .

    Pending jobs will not be picked up while drain mode is active, but they will resume when drain mode is disabled. If you want a clean slate, cancel pending jobs before enabling drain mode.

How to disable

Set drainMode back to "off" and apply:

drainMode: "off"
Terminal window
cd terraform/snowpack-api/env/dev
terraform apply

Post-disable verification

After disabling drain mode, confirm that Snowpack is fully operational:

  1. Check readiness:

    Terminal window
    curl -s https://<snowpack-host>/readyz

    Expect a 200 response. If the health-sync CronJob has not run since the maintenance window, /readyz may return 503 until the next sync cycle populates the table cache.

  2. Submit a test dry-run job:

    Terminal window
    curl -X POST https://<snowpack-host>/tables/<database>/<table>/maintenance \
    -H "Content-Type: application/json" \
    -d '{"actions": ["expire_snapshots"], "dry_run": true}'

    A 202 Accepted response confirms that the submission path is working end-to-end.