Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide option to specify the charset or collation of text columns in the registry tables #763

Open
datafoo opened this issue Jun 14, 2023 · 5 comments

Comments

@datafoo
Copy link

datafoo commented Jun 14, 2023

PostgreSQL indicates:

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

Could Sqitch create its tables using the C collation rather than the default collation?

@datafoo
Copy link
Author

datafoo commented Jun 14, 2023

That would remove the need from re-indexing due to collation version mismatch (see https://www.postgresql.org/docs/current/sql-altercollation.html#SQL-ALTERCOLLATION-NOTES) since the C collation does not change.

@theory
Copy link
Collaborator

theory commented Jun 19, 2023

Yeah, probably a good idea, although even the C and POSIX collations can change when you upgrade glibc. There was a ton of discussion of the issue at PGCon this year, including this session. Maybe C locale would be the least bad, tho, considering that Sqitch mainly relies on indexes for uniqueness rather than sorting (it mainly sorts by dates).

@datafoo
Copy link
Author

datafoo commented Jun 20, 2023

even the C and POSIX collations can change when you upgrade glibc

I was under the impression that C collation doesn't ever change after reading a comment from @laurenz on a blog article he wrote at https://www.cybertec-postgresql.com/en/icu-collations-against-postgresql-data-corruption/#comment-6208123551

[the C collation] doesn't ever change. it is just the order defined by memcmp() of the strings, which only depends on the encoding.

Look here:

The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

@theory
Copy link
Collaborator

theory commented Jun 20, 2023

Oh, perhaps you're right, I forgot about that. The examples from the presentation use en-US.UTF-8.

theory added a commit that referenced this issue Jun 25, 2023
Applies also to Yugabyte, and separately add for Cockroach. Using the
POSIX collation ensures that index ordering never changes when the
database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
@theory theory self-assigned this Jun 25, 2023
@theory theory added engine patched Fixed in a branch but not yet merged. labels Jun 25, 2023
theory added a commit that referenced this issue Jun 25, 2023
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Jul 1, 2023
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Jul 29, 2023
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Sep 2, 2023
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Feb 4, 2024
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Feb 5, 2024
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Jan 4, 2025
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Jan 4, 2025
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
theory added a commit that referenced this issue Jan 4, 2025
Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher.

Using the POSIX collation ensures that index ordering never changes when
the database is upgraded, which is a particular problem with glibc
collations, but since POSIX/C collation is strictly byte-ordered, it
should be fine. Of course, any of use of `ORDER BY` on such columns will
return unexpected results when users are used to other locales, but
since Sqitch itself only ever orders by timestamp, it should not be an
issue in its own use.

Closes #763.
@theory
Copy link
Collaborator

theory commented Jan 5, 2025

I'm gonna drop this issue for now. There just hasn't been the demand, and as my notes in #765 and #857 show, it's a non-trivial problem. If, however, one would like to enforce C/POSIX collation, here's a script to do it:

BEGIN;

ALTER TABLE :"registry".releases
      ALTER COLUMN installer_name  SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN installer_email SET DATA TYPE TEXT COLLATE :"collation"
;

ALTER TABLE :"registry".projects
      ALTER COLUMN project       SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN uri           SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN creator_name  SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN creator_email SET DATA TYPE TEXT COLLATE :"collation"
;

ALTER TABLE :"registry".changes
      ALTER COLUMN change_id       SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN script_hash     SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN project         SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN note            SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN committer_name  SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN committer_email SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN planner_name    SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN planner_email   SET DATA TYPE TEXT COLLATE :"collation"
;

ALTER TABLE :"registry".tags
      ALTER COLUMN tag_id          SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN tag             SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN project         SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN change_id       SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN note            SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN committer_name  SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN committer_email SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN planner_name    SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN planner_email   SET DATA TYPE TEXT COLLATE :"collation"
;

ALTER TABLE :"registry".dependencies
      ALTER COLUMN change_id     SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN type          SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN dependency    SET DATA TYPE TEXT COLLATE :"collation",
      ALTER COLUMN dependency_id SET DATA TYPE TEXT COLLATE :"collation"
;

ALTER TABLE :"registry".events
      ALTER COLUMN change_id       SET DATA TYPE TEXT   COLLATE :"collation",
      ALTER COLUMN change          SET DATA TYPE TEXT   COLLATE :"collation",
      ALTER COLUMN project         SET DATA TYPE TEXT   COLLATE :"collation",
      ALTER COLUMN note            SET DATA TYPE TEXT   COLLATE :"collation",
      ALTER COLUMN requires        SET DATA TYPE TEXT[] COLLATE :"collation",
      ALTER COLUMN conflicts       SET DATA TYPE TEXT[] COLLATE :"collation",
      ALTER COLUMN tags            SET DATA TYPE TEXT[] COLLATE :"collation",
      ALTER COLUMN committer_name  SET DATA TYPE TEXT   COLLATE :"collation",
      ALTER COLUMN committer_email SET DATA TYPE TEXT   COLLATE :"collation",
      ALTER COLUMN planner_name    SET DATA TYPE TEXT   COLLATE :"collation",
      ALTER COLUMN planner_email   SET DATA TYPE TEXT   COLLATE :"collation"
;

COMMIT;

Use psql -set registry=sqitch --set collation=POSIX to convert all text columns to POSIX collation in the sqitch schema.

@theory theory changed the title Could sqitch tables use C collation rather than the default collation? Provide option to specify the charset or collation of text columns in the registry tables Jan 5, 2025
@theory theory added todo feature and removed patched Fixed in a branch but not yet merged. labels Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants