Make backfill batch selection exclude rows inserted or updated after backfill start #648

andrew-farries · 2025-02-03T12:36:11Z

Backfill only rows present at backfill start. This is a second approach to solving #583; the first one is #634.

Change the backfill algorithm to only backfill rows that were present at the start of the backfill process. Rows inserted or updated after backfill start will be backfilled by the already-installed up trigger and do not need to be backfilled by the backfill process (although doing so is safe from a correctness perspective).

Avoiding backfilling rows that were inserted or updated after the backfill start ensures that the backfill process is guaranteed to terminate, even if a large number of rows are inserted or updated during the backfill process.

The new algorithm works as follows:

Create a 'batch table' in the pgroll schema. The batch table is used to store the primary key values of each batch of rows to be updated during the backfill process. The table holds at most batchSize rows at a time and is TRUNCATEd at the start of each batch.
Begin a REPEATABLE READ transaction and take a transaction snapshot. This transaction remains open for the duration of the backfill so that other transactions can use the snapshot.
For each batch:
1. The primary key values for each batch of rows to be updated is INSERT INTO a table. The transaction that does the INSERT INTO uses the snapshot taken in step 1 so that only rows present at the start of the backfill are visible.
2. The batch of rows is updated in the table being backfilled by setting their primary keys to themselves (a no-op update). This update causes any ON UPDATE trigger to fire for the affected rows.

The 'batch table' is necessary as a temporary store of the primary key values of the rows to be updated because the per-batch transaction that selects the rows to be updated runs in a REPEATABLE READ transaction (by necessity to use the transaction snapshot). Trying to update the selected batch of rows in the same transaction would fail with serialization errors in the case where a row in the batch had been updated by a transaction committed after the snapshot was taken. Such serialization errors can safely be ignored, as any rows updated after the snapshot was taken will already have been backfilled by the up trigger. In order to avoid the serialization errors therefore, the batch of rows to be updated is written to the 'batch table' from where the batch can be UPDATEd from a READ COMMITTED transaction that can not encounter serialization errors.

The largest drawback of this approach is that it requires holding a transaction open during the backfill process. Long-running transactions can cause bloat in the database by preventing vacuuming of dead rows.

`RawConn` returns the underlying `*sql.DB` for the connection.

Change the method signature to accept `sql.TxOptions` as the second argument. This allows transactions to run at isolation levels other than the default `READ COMMITTED` level.

Allow clients to define the schema where `pgroll` stores its internal state.

Add tests for: * Generating SQL to create a batch table * Generating SQL to select a batch into a batch table * Generating SQL to update a batch

Change the backfill algorithm to only backfill rows that were present at the start of the backfill process. Rows inserted or updated after backfill start will be backfilled by the already-installed `up` trigger and do not need to be backfilled by the backfill process (although doing so is safe from a correctness perspective). Avoiding backfilling rows that were inserted or updated after the backfill start ensures that the backfill process is guaranteed to terminate, even if a large number of rows are inserted or updated during the backfill process. The new algorithm works as follows: * Create a 'batch table' in the `pgroll` schema. The batch table is used to store the primary key values of each batch of rows to be updated during the backfill process. The table holds at most `batchSize` rows at a time and is `TRUNCATE`d at the start of each batch. * Begin a `REPEATABLE READ` transaction and take a transaction snapshot. This transaction remains open for the duration of the backfill so that other transactions can use the snapshot. * For each batch: 1. The primary key values for each batch of rows to be updated is `INSERT INTO` a table. The transaction that does the `INSERT INTO` uses the snapshot taken in step 1 so that only rows present at the start of the backfill are visible. 2. The batch of rows is updated in the table being backfilled by setting their primary keys to themselves (a no-op update). This update causes any `ON UPDATE` trigger to fire for the affected rows. The 'batch table' is necesary as a temporary store of the primary key values of the rows to be updated because the per-batch transaction that selects the rows to be updated runs in a `REPEATABLE READ` transaction (by necessity to use the transaction snapshot). Trying to update the selected batch of rows in the same transaction would fail with serialization errors in the case where a row in the batch had been updated by a transaction committed after the snapshot was taken. Such serialization errors can safely be ignored, as any rows updated after the snapshot was taken will already have been backfilled by the `up` trigger. In order to avoid the serialization errors therefore, the batch of rows to be updated is written to the 'batch table' from where the batch can be `UPDATE`d from a `READ COMMITTED` transaction that can not encounter serialization errors. The largest drawback of this approach is that it requires holding a transaction open during the backfill process. Long-running transactions can cause bloat in the database by preventing vacuuming of dead rows.

andrew-farries added 5 commits February 3, 2025 07:15

Add a RawConn method to the DB interface

e16e921

`RawConn` returns the underlying `*sql.DB` for the connection.

Update WithRetryableTransaction method

12abfb0

Change the method signature to accept `sql.TxOptions` as the second argument. This allows transactions to run at isolation levels other than the default `READ COMMITTED` level.

Add WithStateSchema option to Backfill

a05636b

Allow clients to define the schema where `pgroll` stores its internal state.

Add tests for SQL generation

2b80c73

Add tests for: * Generating SQL to create a batch table * Generating SQL to select a batch into a batch table * Generating SQL to update a batch

andrew-farries mentioned this pull request Feb 3, 2025

Update DB interface methods #647

Closed

andrew-farries added the help wanted Extra attention is needed label Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make backfill batch selection exclude rows inserted or updated after backfill start #648

Make backfill batch selection exclude rows inserted or updated after backfill start #648

andrew-farries commented Feb 3, 2025

Make backfill batch selection exclude rows inserted or updated after backfill start #648

Are you sure you want to change the base?

Make backfill batch selection exclude rows inserted or updated after backfill start #648

Conversation

andrew-farries commented Feb 3, 2025