You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At first I want to say that this case is just a corner case and I am not trying to stir up trouble. I just report it for robustness of pg_repack.
Maybe you have found that I am always reporting issues of corner cases. The reason it that I transplanted pg_repack to Alibaba Cloud PolarDB-PG and I built a test framework for it to find out the corner cases and improve the stability of this commercial database.
NOTICE: Setting up workers.conns
INFO: repacking table "normal.t1"
NOTICE: Canceled 1 unsafe queries
ERROR: query failed: ERROR: column "foo" does not exist
LINE 1: ...,NULL::integer AS "........pg.dropped.18........",foo,col_wi...
^
DETAIL: query was: INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1
It seems that the table does not have column foo but copy_data command has the column.
The original table and new table both don't have column foo now.
ddl_full0=# \d t1
Table "normal.t1"
Column | Type | Collation | Nullable | Default
------------------+--------------------------------+-----------+----------+--------------------------------------
id | integer | | not null |
unique1 | numeric | | |
hundred | numeric | | |
tenthous | numeric | | |
a | text | | |
b | date | | |
c | timestamp(6) without time zone | | |
d | double precision | | not null |
f_char | character(256) | | |
g_vchar | character varying(256) | | |
h_bool | boolean | | |
i_enum | enum_type | | |
j_array_bigint | bigint[] | | |
extra | jsonb | | |
e | integer | | not null | nextval('t1_e_seq'::regclass)
col_with_default | integer | | | nextval('seq_for_default'::regclass)
Indexes:
"t1_pkey"PRIMARY KEY, btree (id, e)
"t1_idx_btree_a" btree (a)
"t1_idx_g_hash" hash (g_vchar)
"t1_idx_gin_j_array_bigint" gin (j_array_bigint)
"t1_idx_gist_char" gist (f_char)
ddl_full0=# \d repack.table_29169
Table "repack.table_29169"
Column | Type | Collation | Nullable | Default
-------------------------------+--------------------------------+-----------+----------+---------
id | integer | | |
unique1 | numeric | | |
hundred | numeric | | |
tenthous | numeric | | |
a | text | | |
b | date | | |
c | timestamp(6) without time zone | | |
d | double precision | | |
f_char | character(256) | | |
g_vchar | character varying(256) | | |
h_bool | boolean | | |
i_enum | enum_type | | |
j_array_bigint | bigint[] | | |
extra | jsonb | | |
e | integer | | |
........pg.dropped.17........ | integer | | |
........pg.dropped.18........ | integer | | |
........pg.dropped.19........ | integer | | |
col_with_default | integer | | |
Since pg_repack held shared lock in the txn, other session would not be able to drop the column concurrently.
2024-09-03 04:32:25.310 UTC 22026 ddl_full0 00000 0 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL SERIALIZABLE
……
2024-09-03 04:32:25.403 UTC 22026 ddl_full0 00000 2592 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE normal.t1 IN ACCESS SHARE MODE
2024-09-03 04:32:25.507 UTC 22026 ddl_full0 00000 2592 pg_repack LOG: execute <unnamed>: /*pg_catalog, pg_temp, public*/ SELECT repack.create_table('29169', 'pg_default')
2024-09-03 04:32:25.509 UTC 22026 ddl_full0 00000 0 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ ALTER TABLE repack.table_29169 ALTER hundred SET STORAGE EXTENDED
2024-09-03 04:32:25.518 UTC 22026 ddl_full0 42703 0 pg_repack LOG: statement: /*pg_catalog, pg_temp, public*/ INSERT INTO repack.table_29169 SELECT id,unique1,hundred,tenthous,a,b,c,d,f_char,g_vchar,h_bool,i_enum,j_array_bigint,extra,e,NULL::integer AS "........pg.dropped.17........",NULL::integer AS "........pg.dropped.18........",foo,col_with_default FROM ONLY normal.t1
With the audit log we can see that the column was dropped before copy_data executed:
2024-09-0304:32:12.694 UTC 20935 ddl_full0 000002489 pgbench LOG: execute <unnamed>: /*normal, public*/ ALTER TABLE IF EXISTS t1 DROP COLUMN IF EXISTS foo;
So it seems impossible that pg_repack's copy_data command had column foo in it. This confused me for a long time. But finally I got the answer by reading the code.
copy_data is defined in repack.tables view and the repack.get_columns_for_create_as() function is executed in repack_one_database(), then it saves the copy_data sql text into memory. At this moment, column foo is still in the table.
'INSERT INTO repack.table_'||R.oid||' SELECT '||repack.get_columns_for_create_as(R.oid) ||' FROM ONLY '||repack.oid2text(R.oid) AS copy_data,
table.copy_data= getstr(res, i , c++);
Before repack_one_database() calls repack_one_table(), a DDL command drops column foo. This is possible because pg_repack holds no lock in repack_one_database().
Then repack_one_database() calls repack_one_table() and executes the copy_data command generated before. Now the table definition and copy_data's table definition mismatch.
How to reproduce it:
Add command("select pg_sleep(20)", 0, NULL); in repack_one_database() after copy_data generated
create table test(a int primary key, b text);
pg_repack --dbname=postgres --table=test
Get the pid of pg_repack client and execute gdb attach pid, pg_repack sleeps for 20s and I think it's enough for us. We can see that copy_data text includes column a and b
(gdb) f 12
#12 0x0000000000405809 in repack_one_database (orderby=0x0, errbuf=0x7ffd8562f6b0 "\240\222]", errsize=256) at pg_repack.c:1007
(gdb) p table.copy_data
$1 = 0x5f2009 "INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test"
pg_repack hangs and now we connect to db by psql, there's no lock on table test now, so we can drop column b
postgres=# select * from pg_locks where relation = 'test'::regclass;
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | m
ode | granted | fastpath | waitstart
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+------+---------+----------+-----------
(0 rows)
postgres=# alter table test drop column b;ALTERTABLE
Execute 'c' in gdb to continue, copy_data is executed and we get the error
pg_repack --dbname=postgres --table=test
INFO: repacking table "public.test"
ERROR: query failed: ERROR: column "b" does not exist
LINE 1: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.t...
^
DETAIL: query was: INSERT INTO repack.table_41928 SELECT a,b FROM ONLY public.test
The text was updated successfully, but these errors were encountered:
zhanghien
changed the title
copy_data fail if table definition changed after copy_data string saved to memory in repack_one_database()
copy_data fail if table definition changed after copy_data SQL string saved to memory in repack_one_database()
Sep 4, 2024
Hi all.
I executed this command with pg_repack 1.5.0:
I got the error below:
It seems that the table does not have column foo but copy_data command has the column.
The original table and new table both don't have column foo now.
Since pg_repack held shared lock in the txn, other session would not be able to drop the column concurrently.
With the audit log we can see that the column was dropped before copy_data executed:
So it seems impossible that pg_repack's copy_data command had column foo in it. This confused me for a long time. But finally I got the answer by reading the code.
copy_data is defined in repack.tables view and the repack.get_columns_for_create_as() function is executed in repack_one_database(), then it saves the copy_data sql text into memory. At this moment, column foo is still in the table.
Before repack_one_database() calls repack_one_table(), a DDL command drops column foo. This is possible because pg_repack holds no lock in repack_one_database().
Then repack_one_database() calls repack_one_table() and executes the copy_data command generated before. Now the table definition and copy_data's table definition mismatch.
How to reproduce it:
command("select pg_sleep(20)", 0, NULL);
in repack_one_database() after copy_data generatedgdb attach pid
, pg_repack sleeps for 20s and I think it's enough for us. We can see that copy_data text includes column a and bThe text was updated successfully, but these errors were encountered: