Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL backup verification is failing in v2.8.1 and v2.9.0 #132

Closed
WalterMoar opened this issue Aug 14, 2024 · 3 comments · Fixed by #139
Closed

PostgreSQL backup verification is failing in v2.8.1 and v2.9.0 #132

WalterMoar opened this issue Aug 14, 2024 · 3 comments · Fixed by #139
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@WalterMoar
Copy link
Contributor

When upgrading to v2.9.0 our PostgreSQL 13.13 backup verifications are failing with the error:

ERROR: role "postgres" already exists

When looking into the problem it was found that everything works fine with v2.8.0, but v2.8.1 and v2.9.0 fail in this same way.

The backups up to and including v2.8.0 have the PostgreSQL database dump at the top of the backup file, followed by the PostgreSQL database cluster dump that is:

--
-- PostgreSQL database cluster dump
--

SET default_transaction_read_only = off;

SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;

--
-- Roles
--

CREATE ROLE postgres;
ALTER ROLE postgres WITH SUPERUSER INHERIT CREATEROLE CREATEDB LOGIN REPLICATION BYPASSRLS;
CREATE ROLE "userRLX";
ALTER ROLE "userRLX" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN NOREPLICATION NOBYPASSRLS;

--
-- PostgreSQL database cluster dump complete
--

However, in v2.8.1 and v2.9.0, the order of these two sections is reversed, with the PostgreSQL database cluster dump containing the CREATE ROLE postgres coming before the PostgreSQL database dump. It looks like the order was switched in commit eda584f.

The logs for the failure look like:

Restoring '/backups/daily/2024-08-14/chefs-redash-postgresql-redash_2024-08-14_12-16-23.sql.gz' to '127.0.0.1/redash' ...

DROP DATABASE "redash";
DROP DATABASE
CREATE DATABASE "redash";
CREATE DATABASE
DROP SCHEMA IF EXISTS metric_helpers CASCADE;
NOTICE: schema "metric_helpers" does not exist, skipping
DROP SCHEMA
DROP SCHEMA IF EXISTS user_management CASCADE;
NOTICE: schema "user_management" does not exist, skipping
DROP SCHEMA
GRANT ALL ON DATABASE "redash" TO "userXXX";
GRANT
SET
SET
SET
ERROR: role "postgres" already exists
Restore failed.

To complicate things, the same error does occur in v2.8.0, even though the verification is marked as a success:

Restoring '/backups/daily/2024-08-14/chefs-redash-postgresql-redash_2024-08-14_12-01-16.sql.gz' to '127.0.0.1/redash' ...
DROP DATABASE "redash";
DROP DATABASE
CREATE DATABASE "redash";
CREATE DATABASE
GRANT ALL ON DATABASE "redash" TO "userRLX";
GRANT
backup.sql
roles.sql
SET
SET
SET
ERROR: role "postgres" already exists
SET
SET
[...carries on without failing...]

So if it was continuing with the verification even if there were errors (which is bad) is it possible that commit 7e46504 didn't get included in the build until v2.8.1?

@esune
Copy link
Member

esune commented Jan 9, 2025

I had to dig into this issue as verification is failing when testing my latest changes from #139
It looks like the ability to ignore errors was introduced with 2.7.0.

I started with testing a known working backup created using 2.6.0: verification succeeds with 2.6.0 as well as with the latest version.
I then created a new backup with the latest version and tried verifying it: backups is executed correctly, but verification fails with the above error.
I also tried creating a backup using 2.7.0 and verifying it, and the same error appears.

Comparing the sql files from each backup shows that the difference consists in the newer (post 2.7.0) version including the roles backup section, while the old version does not have that. Removing the changes introduced in 2.7.0 by #115 seems to resolve the problem and verification succeeds.

@bolyachevets I see you introduced the change likely to fix issues with the pager duty exporter process and related custom roles. Do you have any insight on what exactly is different in your use-case when compared with more "plain" backup-container usage? It looks like while things may work in that scenario (which I can't test myself), they will consistently break for standard usage.

@esune
Copy link
Member

esune commented Jan 9, 2025

I think the way we start the local server (see here) is creating the users that are in the role dump and possibly causing the issues. I will test, if that is the case we might need to have a breaking change to only support roles, at least during verification, from the dump and not specify custom users during server startup - in this case older backup files without roles will likely not be verifiable anymore.

@WadeBarnes
Copy link
Member

WadeBarnes commented Jan 10, 2025

@esune, I think you're headed in the right direction with identifying the issue. The code you pointed out uses the built in initialization scripts in the container to start the postgres server locally (mimicking how an new database instance is initialized in OpenShift). That includes creation of the user and admin accounts and credentials based on the environemnt variable settings. So if the creation of any of those accounts is duplicated in the dump there are going to be errors.

This issue seems to affect the restore process, which is the first step in the verification process. We need to ensure the restore process continues to work without error, otherwise we loose the whole purpose of the backup-container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants