Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy_data fail if table definition changed after copy_data SQL string saved to memory in repack_one_database() #421

Open
zhanghien opened this issue Sep 4, 2024 · 0 comments
Milestone

Comments

@zhanghien
Copy link

Hi all.

At first I want to say that this case is just a corner case and I am not trying to stir up trouble. I just report it for robustness of pg_repack.
Maybe you have found that I am always reporting issues of corner cases. The reason it that I transplanted pg_repack to Alibaba Cloud PolarDB-PG and I built a test framework for it to find out the corner cases and improve the stability of this commercial database.

I executed this command with pg_repack 1.5.0:

pg_repack -Utest_user --dbname=ddl_full0 -k --table=normal.t1 --jobs=3

I got the error below:

NOTICE: Setting up workers.conns
INFO: repacking table "normal.t1"
NOTICE: Canceled 1 unsafe queries
ERROR: query failed: ERROR:  column "foo" does not exist
LINE 1: ...,NULL::integer AS "........pg.dropped.18........",foo,col_wi...
                                                                                             ^
DETAIL: query was: INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1

It seems that the table does not have column foo but copy_data command has the column.

The original table and new table both don't have column foo now.

ddl_full0=# \d t1
                                                Table "normal.t1"
      Column      |              Type              | Collation | Nullable |               Default
------------------+--------------------------------+-----------+----------+--------------------------------------
 id               | integer                        |           | not null |
 unique1          | numeric                        |           |          |
 hundred          | numeric                        |           |          |
 tenthous         | numeric                        |           |          |
 a                | text                           |           |          |
 b                | date                           |           |          |
 c                | timestamp(6) without time zone |           |          |
 d                | double precision               |           | not null |
 f_char           | character(256)                 |           |          |
 g_vchar          | character varying(256)         |           |          |
 h_bool           | boolean                        |           |          |
 i_enum           | enum_type                      |           |          |
 j_array_bigint   | bigint[]                       |           |          |
 extra            | jsonb                          |           |          |
 e                | integer                        |           | not null | nextval('t1_e_seq'::regclass)
 col_with_default | integer                        |           |          | nextval('seq_for_default'::regclass)
Indexes:
    "t1_pkey" PRIMARY KEY, btree (id, e)
    "t1_idx_btree_a" btree (a)
    "t1_idx_g_hash" hash (g_vchar)
    "t1_idx_gin_j_array_bigint" gin (j_array_bigint)
    "t1_idx_gist_char" gist (f_char)

ddl_full0=# \d repack.table_29169
                                   Table "repack.table_29169"
            Column             |              Type              | Collation | Nullable | Default
-------------------------------+--------------------------------+-----------+----------+---------
 id                            | integer                        |           |          |
 unique1                       | numeric                        |           |          |
 hundred                       | numeric                        |           |          |
 tenthous                      | numeric                        |           |          |
 a                             | text                           |           |          |
 b                             | date                           |           |          |
 c                             | timestamp(6) without time zone |           |          |
 d                             | double precision               |           |          |
 f_char                        | character(256)                 |           |          |
 g_vchar                       | character varying(256)         |           |          |
 h_bool                        | boolean                        |           |          |
 i_enum                        | enum_type                      |           |          |
 j_array_bigint                | bigint[]                       |           |          |
 extra                         | jsonb                          |           |          |
 e                             | integer                        |           |          |
 ........pg.dropped.17........ | integer                        |           |          |
 ........pg.dropped.18........ | integer                        |           |          |
 ........pg.dropped.19........ | integer                        |           |          |
 col_with_default              | integer                        |           |          |

Since pg_repack held shared lock in the txn, other session would not be able to drop the column concurrently.

2024-09-03 04:32:25.310 UTC 22026 ddl_full0  00000 0 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL SERIALIZABLE
……
2024-09-03 04:32:25.403 UTC 22026 ddl_full0  00000 2592 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE normal.t1 IN ACCESS SHARE MODE
2024-09-03 04:32:25.507 UTC 22026 ddl_full0  00000 2592 pg_repack LOG:  execute <unnamed>: /*pg_catalog, pg_temp, public*/ SELECT repack.create_table('29169', 'pg_default')
2024-09-03 04:32:25.509 UTC 22026 ddl_full0  00000 0 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ ALTER TABLE repack.table_29169 ALTER hundred SET STORAGE EXTENDED
2024-09-03 04:32:25.518 UTC 22026 ddl_full0  42703 0 pg_repack LOG:  statement: /*pg_catalog, pg_temp, public*/ INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1

With the audit log we can see that the column was dropped before copy_data executed:

2024-09-03 04:32:12.694 UTC 20935 ddl_full0  00000 2489 pgbench LOG:  execute <unnamed>: /*normal, public*/ ALTER TABLE IF EXISTS t1 DROP COLUMN IF EXISTS foo;

So it seems impossible that pg_repack's copy_data command had column foo in it. This confused me for a long time. But finally I got the answer by reading the code.

copy_data is defined in repack.tables view and the repack.get_columns_for_create_as() function is executed in repack_one_database(), then it saves the copy_data sql text into memory. At this moment, column foo is still in the table.

'INSERT INTO repack.table_' || R.oid || ' SELECT ' || repack.get_columns_for_create_as(R.oid) || ' FROM ONLY ' || repack.oid2text(R.oid) AS copy_data,
table.copy_data = getstr(res, i , c++);

Before repack_one_database() calls repack_one_table(), a DDL command drops column foo. This is possible because pg_repack holds no lock in repack_one_database().
Then repack_one_database() calls repack_one_table() and executes the copy_data command generated before. Now the table definition and copy_data's table definition mismatch.

How to reproduce it:

  1. Add command("select pg_sleep(20)", 0, NULL); in repack_one_database() after copy_data generated
  2. create table test(a int primary key, b text);
  3. pg_repack --dbname=postgres --table=test
  4. Get the pid of pg_repack client and execute gdb attach pid, pg_repack sleeps for 20s and I think it's enough for us. We can see that copy_data text includes column a and b
(gdb) f    12
#12 0x0000000000405809 in repack_one_database (orderby=0x0, errbuf=0x7ffd8562f6b0 "\240\222]", errsize=256) at pg_repack.c:1007
(gdb) p    table.copy_data
$1 = 0x5f2009 "INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test"
  1. pg_repack hangs and now we connect to db by psql, there's no lock on table test now, so we can drop column b
postgres=# select * from pg_locks where relation = 'test'::regclass;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | m
ode | granted | fastpath | waitstart
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+--
----+---------+----------+-----------
(0 rows)
postgres=# alter table test drop column b;
ALTER TABLE
  1. Execute 'c' in gdb to continue, copy_data is executed and we get the error
pg_repack --dbname=postgres --table=test
INFO: repacking table "public.test"
ERROR: query failed: ERROR:  column "b" does not exist
LINE 1: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.t...
                                               ^
DETAIL: query was: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test
@zhanghien zhanghien changed the title copy_data fail if table definition changed after copy_data string saved to memory in repack_one_database() copy_data fail if table definition changed after copy_data SQL string saved to memory in repack_one_database() Sep 4, 2024
@za-arthur za-arthur added the bug label Sep 4, 2024
@za-arthur za-arthur added this to the 1.6.0 milestone Sep 8, 2024
@za-arthur za-arthur added enhancement and removed bug labels Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants