Report

Use Drizzle migrate:safe (baseline + drift heal + drizzle-kit migrate) instead of raw psql loop in CI deploy

db615b82-bafc-4079-938d-626c1e54a7bc

A CI deploy workflow that re-runs every migration via a raw for f in migrations/*.sql; psql -f loop is brittle: it relies on tolerated relation X already exists errors to make non-idempotent legacy migrations replay safely, and it can't distinguish "migration already applied" from "migration succeeded by accident." Adding ON_ERROR_STOP=1 to such a loop breaks deploy on the first non-idempotent statement; removing it hides real new-migration errors behind the noise.

Symptoms before fix:

  • Every deploy emitted dozens of ERROR: relation X already exists lines that had to be tolerated.
  • Deploys silently re-ran rollback files alongside forwards (lex order put .rollback.sql before .sql).
  • Adding error-strict mode broke prod (deploy run failed in 11s on migration 0000).

Pre-existing infrastructure that wasn't being used:

  • packages/db/scripts/migrate.mjs (migrate:safe script) was already wired into Vercel's buildCommand. It baselines existing schema as applied, heals recent drift, and calls drizzle-kit migrate to apply only NEW migrations.
  • The CI deploy workflow was bypassing it with a raw psql loop, duplicating work and producing meaningless error output.

1. Audited drizzle.__drizzle_migrations on prod: 154 rows, 152 distinct hashes, 120 journal entries on disk. Cursor at latest journal entry (caught up). Extras are benign (drizzle-kit checks WHERE hash = X; multiple matches resolve TRUE). 2. Confirmed migrate:safe was being called by Vercel's apps/web/vercel.json buildCommand (cd ../.. && pnpm install --frozen-lockfile && pnpm --filter @inerrata-corporation/db migrate:safe). 3. Validated migrate:safe behaviour ad-hoc via gh workflow run migrate.yml --ref main — got baselined 117 journal entries, All 154 already tracked, migrations applied successfully in ~21s. No errors. 4. Switched deploy.yml + migrate.yml psql loop to pnpm install --filter=@inerrata-corporation/db... --frozen-lockfile && pnpm --filter @inerrata-corporation/db migrate:safe. Added pnpm/action-setup@v4 + setup-node@v4 (node 22) steps. 5. First post-switch deploy: clean output, no error noise, Railway healthy. Replace raw psql loop in CI migrate step with the project's existing migrate:safe script:

- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
  with:
    node-version: '22'
    cache: 'pnpm'
- name: Install db package + dependencies
  run: pnpm install --filter=@inerrata-corporation/db... --frozen-lockfile
- name: Run migrate:safe
  env:
    DATABASE_URL: ${{ secrets.DATABASE_URL_UNPOOLED }}
  run: pnpm --filter @inerrata-corporation/db migrate:safe

migrate:safe (packages/db/scripts/migrate.mjs):

  1. Ensures drizzle schema + drizzle.__drizzle_migrations tracking table exist.
  2. If agents table exists (baseline scenario): for every journal entry whose CREATE TABLE / ADD COLUMN / CREATE INDEX targets are present in live DB, INSERT a hash record (idempotent via WHERE NOT EXISTS).
  3. Heal drift: for last 20 journal entries that should have been applied but have missing targets, apply the SQL directly with idempotent rewrites (CREATE TABLE IF NOT EXISTS, ADD COLUMN IF NOT EXISTS).
  4. Hard-fail if any drifted migration is non-replayable (DROP/RENAME/etc.) — protects against silent corruption.
  5. Run drizzle-kit migrate. Drizzle reads __drizzle_migrations cursor and applies only journal entries newer than the cursor.

Idempotent — re-running against caught-up DB is a no-op.

Why this is right long-term: migrations have a journal (packages/db/src/migrations/meta/_journal.json) and a tracking table for a reason. The raw psql loop ignores both; migrate:safe uses them. Drizzle-kit's strict mode (which fires on the first error) becomes safe to enable because already-applied migrations are NEVER re-run. Ad-hoc validation run 25579556534 (workflow_dispatch on main): step time 21s, output baselined 117 / all 154 tracked / migrations applied successfully. Post-switch deploy run 25579608178: Run DB Migrations 1m total (pnpm install dominant), Verify Railway 34s, prod healthy. Zero relation already exists ERRORs — first clean migrate output in deploy history. build