Orphaned Postgres Slots Causing WAL Bloat (2026)

Akshit Ahuja

Co-Founder & Lead Engineer

February 25, 2026

#postgres#replication#logical-replication#wal#production-incidents#managed-postgres#devops

This is a weird one because nothing looks broken until your database is 95% full.

Your app is fine. CPU is fine. P95 is fine. Then storage alerts start screaming and the box is basically bricked.

A common cause in 2026: an orphaned replication slot keeping write-ahead log (WAL) forever.

This happens a lot on managed Postgres (Neon, Supabase, RDS, Cloud SQL) because you try CDC for analytics, a read model, or a tool like Debezium or Airbyte. Then you kill the consumer and forget the slot.

Postgres will keep WAL around for a slot even if the consumer died weeks ago. That is the whole point of slots. It is also how you get a surprise outage.

The failure mode nobody graphs (until after the outage)

Replication slots persist independently of the connection using them and they are crash-safe. That means a slot can sit there, quietly pinning WAL, even when there is zero replication traffic.

Postgres docs basically warn you: slots prevent removal of required resources and can consume storage in extreme cases until you drop the slot.

The most annoying part is the lag between cause and effect. A connector breaks on Monday. The incident happens on Thursday when WAL backlog finally fills disk.

A real incident shape (numbers you can sanity check)

Here is the pattern we see in rescues.

1) Founder runs logical replication to a reporting DB or a CDC pipeline to a warehouse.

2) A DDL change hits prod (add a column, change a type).

3) Subscriber does not apply the DDL (logical replication does not copy DDL). Apply worker starts erroring and retrying.

4) The slot stays around and WAL cannot be recycled.

One published case study shows a slot retaining about 185 GB of WAL and a pg_wal directory around 210 GB before someone stepped in. That matches what we see in the wild when a DB has steady writes.

On a managed plan with 100-200 GB of storage, that can be a same-day outage.

Quick mental model: why WAL grows forever

WAL is append-only. Postgres checkpoints and removes old segments when it is safe.

A replication slot pins a restart_lsn. Postgres must keep WAL from restart_lsn onward so the consumer can catch up.

If the consumer never acknowledges progress (confirmed_flush_lsn), restart_lsn never moves. Disk use is unbounded unless you put a cap on it.

If you remember one sentence: a slot is a promise to never forget. Promises cost disk.

Step 0: confirm you are actually dying from WAL

Before you nuke slots, prove pg_wal is the thing growing.

On a self-managed box:

du -sh $PGDATA/pg_wal

On managed Postgres you may not have shell access, so you use SQL views and provider metrics. If your provider exposes disk usage but not pg_wal size, look for a strong correlation: disk usage rises while table bloat does not.

Step 1: list slots and compute retained WAL

Run this on the primary:

SELECT
slot_name,
slot_type,
database,
active,
restart_lsn,
confirmed_flush_lsn,
wal_status,
invalidation_reason,
inactive_since,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

What you are looking for: a slot with retained_wal in the tens of GB (or more) that is not clearly a critical replica.

Two details worth knowing in 2026:

- wal_status tells you if WAL is reserved, extended, unreserved, or lost. This shows up in newer Postgres versions.

- invalidation_reason can say things like wal_removed, rows_removed, or idle_timeout.

Step 2: decide if the slot is orphaned

Orphaned does not always mean inactive. Some consumers connect, fail, reconnect, fail. active=true can still happen.

Use a few signals together:

- inactive_since is days or weeks ago

- confirmed_flush_lsn has not changed

- retained_wal only grows

- you cannot find any running service that claims the slot name

Healthy teams name slots like app_env_purpose. If you see slot names like 'airbyte' or 'debezium' and you uninstalled that tool, you probably found the culprit.

Step 3: identify the owner without guessing

Slots get created by many things:

- CREATE SUBSCRIPTION (logical replication between Postgres servers)

- Debezium, Airbyte, Fivetran, custom CDC consumers

- Physical streaming replicas for HA

The fastest way to avoid a bad drop is to find what owns it:

1) Search your infra repo for slot_name

2) Search connector configs for 'replication_slot' or 'slot.name'

3) Check if any subscriptions exist (on the subscriber DB):

\dRs+ - in psql, lists subscriptions

If you have a subscription, dropping the slot on the primary will break it.

The blunt fix: drop the slot

If you are confident it is orphaned, do this:

SELECT pg_drop_replication_slot('slot_name_here');

Then watch pg_wal shrink over time. It may not drop instantly; recycling follows checkpoints.

If disk is already near full, you often need to create breathing room first (temporary disk increase on managed Postgres, or add volume on self-managed) so you do not get stuck mid-fix.

If you cannot drop it: unblock the consumer

If the slot is legitimate (payments, billing events, anything that must not miss changes), you do not drop it casually. You make it advance.

Common stall causes:

1) Schema drift (logical replication does not replicate DDL)

2) Permissions (replication user lost rights)

3) Network or TLS changes

4) Connector bugs or version changes

Schema drift playbook

If you do logical replication between Postgres servers, DDL drift is the classic foot-gun. Add a column on primary, subscriber errors, WAL backlog grows.

Fix:

- Apply the same DDL on the subscriber

- Restart the apply worker / connector

- Verify confirmed_flush_lsn starts moving

On the subscriber, pg_stat_subscription is your friend:

SELECT subname, status, last_msg_send_time, last_msg_receipt_time, latest_end_lsn
FROM pg_stat_subscription;

Consumer is dead but you need data

Sometimes you actually want to keep the slot, but the consumer is down for a planned window (upgrade, warehouse outage, etc). That is where caps matter.

If you have no cap, you are gambling with disk.

Emergency mode: when storage is almost full

If you are at 98% disk, your goal is not elegance. Your goal is to keep writes alive.

Order of operations we like:

1) Add disk (fastest on managed providers). Even 20-50 GB buys you time.

2) Drop the clearly orphaned slot(s) or fix the subscriber so it catches up.

3) Force a checkpoint if you can afford it (this can spike IO):

CHECKPOINT;

4) Watch retained_wal drop and disk stabilize.

Do not run random VACUUM FULL in a panic. This outage is not table bloat.

Prevention (2026 defaults we push on clients)

1) Put a hard cap on slot WAL

Postgres 13+ lets you cap how much WAL a slot can retain with max_slot_wal_keep_size.

Example:

max_slot_wal_keep_size = '20GB'

If a consumer falls behind beyond the cap, the slot can become unusable. That hurts the consumer, but it saves production. I will take a broken dashboard over a dead primary.

2) Use idle slot timeouts for garbage slots

Newer Postgres versions support idle_replication_slot_timeout. This can automatically invalidate slots that stay inactive too long. It is not a silver bullet, but it cuts the tail risk for forgotten slots.

Do not enable this blindly on critical HA setups. Use it on analytics and experimental pipelines first.

3) Alert on retained_wal, not just disk

Disk alerts fire late. You want early alerts on slot retention.

A minimal retained WAL check:

SELECT
slot_name,
slot_type,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots;

Alert when retained_bytes crosses something sane for your workload. For many SaaS apps, 5-10 GB on any logical slot is already a smell.

4) Heartbeats for low-traffic DBs

If you run multiple databases in one Postgres instance, one can be quiet while another is chatty. WAL is cluster-wide, but a logical slot is per database. A slot in a quiet DB may not advance and ends up pinning WAL generated by the busy DB.

Solutions:

- heartbeat table updated every minute

- Postgres 14+ pg_logical_emit_message() to write small WAL messages without touching tables

5) Make slot cleanup part of offboarding

When you decommission a connector, also remove the slot. Put it in your runbook and your Terraform destroy steps.

I have seen teams cancel a tool subscription and still keep the slot. They stopped paying the vendor but kept paying in incident time later.

Cost and timeline to recover (so you can plan)

If you catch it early, this is a 30 minute fix.

If you catch it late and storage is full, it turns into a small incident:

- 1-2 hours: get writes back (increase disk, drop slot or unblock subscriber, checkpoint)

- 2-8 hours: re-seed the consumer (depends on data size, snapshot method, network)

- 1 day: add alerts, caps, and ownership tags so it does not happen again

In US-based teams, we usually see this cost $1.5k-$6k in engineering time, depending on how messy the CDC setup is and whether you need a full resync.

A tiny checklist you can paste into your ops repo

- [ ] Weekly: list pg_replication_slots, sort by retained_wal

- [ ] Alert: retained_wal > 10GB for any logical slot

- [ ] Set max_slot_wal_keep_size (pick a number you can afford)

- [ ] Document: who owns each slot and why

- [ ] When decommissioning a pipeline: drop the slot

Closing thought

Replication slots are not evil. They are just honest. They will do exactly what you asked: keep WAL until a consumer says 'got it'.

If you treat CDC as a side quest, slots will punish you. If you treat it like production infra, this problem basically disappears.

---