Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repmgrd does not recognize primary failure #585

Open
mnietz opened this issue Aug 28, 2019 · 0 comments
Open

Repmgrd does not recognize primary failure #585

mnietz opened this issue Aug 28, 2019 · 0 comments
Assignees

Comments

@mnietz
Copy link

mnietz commented Aug 28, 2019

Hi,

we're using Repmgr 4.4 with PostgreSQL 11.5 on Debian 9.9.

The setup consists of three nodes, two Databases nodes and one Witness Server. We've had a outage of the primary (db02) that was not recognized by the standby (db01) but the witness node and therefore no failover happens.

In the postgres logs from the standby we can see, that the replication connection is broken:

2019-08-28 02:00:21.070 CEST [14862] DETAIL:  Last completed transaction was at log time 2019-08-28 01:56:21.004754+02.
2019-08-28 02:35:27.640 CEST [14919] FATAL:  terminating walreceiver due to timeout
2019-08-28 02:35:27.641 CEST [14861] LOG:  invalid resource manager ID 255 at 45D/AC075CA8
2019-08-28 02:37:38.483 CEST [28274] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:39:49.555 CEST [28586] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:42:00.628 CEST [28910] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:44:11.699 CEST [29200] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:46:22.771 CEST [29501] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:48:33.843 CEST [29820] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:49:39.392 CEST [30089] FATAL:  could not connect to the primary server: FATAL:  the database system is starting up
        FATAL:  the database system is starting up
2019-08-28 02:49:39.410 CEST [30297] FATAL:  could not connect to the primary server: FATAL:  the database system is starting up
        FATAL:  the database system is starting up
2019-08-28 02:49:44.423 CEST [30300] LOG:  started streaming WAL from primary at 45D/AC000000 on timeline 4
2019-08-28 02:49:53.797 CEST [14862] LOG:  restartpoint starting: time

Also the witness node logs the failure to repmgrd.log:

[2019-08-28 02:30:07] [INFO] witness node "pgmon" (ID: 3) monitoring primary node "db02" (ID: 2) in normal state
[2019-08-28 02:34:30] [WARNING] unable to ping "host=db02 user=repmgr dbname=repmgr connect_timeout=2"
[2019-08-28 02:34:30] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:30] [INFO] checking state of node 2, 1 of 6 attempts
[2019-08-28 02:34:32] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:34:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:32] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:34:42] [INFO] checking state of node 2, 2 of 6 attempts
[2019-08-28 02:34:44] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:34:44] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:44] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:34:54] [INFO] checking state of node 2, 3 of 6 attempts
[2019-08-28 02:34:56] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:34:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:56] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:35:06] [INFO] checking state of node 2, 4 of 6 attempts
[2019-08-28 02:35:08] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:35:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:35:08] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:35:18] [INFO] checking state of node 2, 5 of 6 attempts
[2019-08-28 02:35:20] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:35:20] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:35:20] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:35:30] [INFO] checking state of node 2, 6 of 6 attempts
[2019-08-28 02:35:32] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:35:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:35:32] [WARNING] unable to reconnect to node 2 after 6 attempts
[2019-08-28 02:36:32] [WARNING] no notification received from new primary after 60 seconds
[2019-08-28 02:36:34] [WARNING] unable to ping "host=db02 user=repmgr dbname=repmgr connect_timeout=2"
[2019-08-28 02:36:34] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:50:12] [ERROR] unable to determine if server is in recovery
[2019-08-28 02:50:12] [DETAIL] 
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[2019-08-28 02:50:12] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2019-08-28 02:50:12] [NOTICE] current upstream node "db02" (ID: 2) is not primary, restarting monitoring
[2019-08-28 02:50:12] [INFO] searching for primary node
[2019-08-28 02:50:12] [INFO] checking if node 2 is primary
[2019-08-28 02:50:12] [INFO] current primary node is 2
[2019-08-28 02:50:12] [INFO] witness monitoring connection to primary node "db02" (ID: 2)
[2019-08-28 02:55:14] [INFO] witness node "pgmon" (ID: 3) monitoring primary node "db02" (ID: 2) in normal state

The System was unavailable for about 15 minutes, but no event was logged to repmgrd.log of the standby and no action was taken to promote db01 to primary.

repmgrd.log (standby)

[2019-08-28 01:52:15] [DETAIL] last monitoring statistics update was 2 seconds ago
[2019-08-28 03:56:33] [ERROR] unable to determine if server is in recovery
[2019-08-28 03:56:33] [DETAIL] 
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[2019-08-28 03:56:33] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2019-08-28 03:56:35] [INFO] node "db01" (ID: 1) monitoring upstream node "db02" (ID: 2) in normal state
[2019-08-28 03:56:35] [WARNING] monitoring_history requested but primary connection not available

The repmgrd process was running on the standby but the monitoring_history table was not updated. The last history-event was recorded at 01:56, that is exactly the time, when the problem started on the primary with a cpu-usage of 100% (iowait).

After i restarted repmgrd service on the standby everything runs fine. For me it looks like a 'half-dead' repmgrd.

Please let me know, if you need additional information about the configuration or outage.

Best regards
Marco

@ibarwick ibarwick self-assigned this Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants