Repmgrd does not recognize primary failure #585

mnietz · 2019-08-28T06:03:56Z

Hi,

we're using Repmgr 4.4 with PostgreSQL 11.5 on Debian 9.9.

The setup consists of three nodes, two Databases nodes and one Witness Server. We've had a outage of the primary (db02) that was not recognized by the standby (db01) but the witness node and therefore no failover happens.

In the postgres logs from the standby we can see, that the replication connection is broken:

2019-08-28 02:00:21.070 CEST [14862] DETAIL:  Last completed transaction was at log time 2019-08-28 01:56:21.004754+02.
2019-08-28 02:35:27.640 CEST [14919] FATAL:  terminating walreceiver due to timeout
2019-08-28 02:35:27.641 CEST [14861] LOG:  invalid resource manager ID 255 at 45D/AC075CA8
2019-08-28 02:37:38.483 CEST [28274] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:39:49.555 CEST [28586] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:42:00.628 CEST [28910] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:44:11.699 CEST [29200] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:46:22.771 CEST [29501] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:48:33.843 CEST [29820] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
                Is the server running on host "db02" (x.x.x.x) and accepting
                TCP/IP connections on port 5432?
2019-08-28 02:49:39.392 CEST [30089] FATAL:  could not connect to the primary server: FATAL:  the database system is starting up
        FATAL:  the database system is starting up
2019-08-28 02:49:39.410 CEST [30297] FATAL:  could not connect to the primary server: FATAL:  the database system is starting up
        FATAL:  the database system is starting up
2019-08-28 02:49:44.423 CEST [30300] LOG:  started streaming WAL from primary at 45D/AC000000 on timeline 4
2019-08-28 02:49:53.797 CEST [14862] LOG:  restartpoint starting: time

Also the witness node logs the failure to repmgrd.log:

[2019-08-28 02:30:07] [INFO] witness node "pgmon" (ID: 3) monitoring primary node "db02" (ID: 2) in normal state
[2019-08-28 02:34:30] [WARNING] unable to ping "host=db02 user=repmgr dbname=repmgr connect_timeout=2"
[2019-08-28 02:34:30] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:30] [INFO] checking state of node 2, 1 of 6 attempts
[2019-08-28 02:34:32] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:34:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:32] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:34:42] [INFO] checking state of node 2, 2 of 6 attempts
[2019-08-28 02:34:44] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:34:44] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:44] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:34:54] [INFO] checking state of node 2, 3 of 6 attempts
[2019-08-28 02:34:56] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:34:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:34:56] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:35:06] [INFO] checking state of node 2, 4 of 6 attempts
[2019-08-28 02:35:08] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:35:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:35:08] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:35:18] [INFO] checking state of node 2, 5 of 6 attempts
[2019-08-28 02:35:20] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:35:20] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:35:20] [INFO] sleeping 10 seconds until next reconnection attempt
[2019-08-28 02:35:30] [INFO] checking state of node 2, 6 of 6 attempts
[2019-08-28 02:35:32] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=db02 fallback_application_name=repmgr"
[2019-08-28 02:35:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:35:32] [WARNING] unable to reconnect to node 2 after 6 attempts
[2019-08-28 02:36:32] [WARNING] no notification received from new primary after 60 seconds
[2019-08-28 02:36:34] [WARNING] unable to ping "host=db02 user=repmgr dbname=repmgr connect_timeout=2"
[2019-08-28 02:36:34] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-08-28 02:50:12] [ERROR] unable to determine if server is in recovery
[2019-08-28 02:50:12] [DETAIL] 
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[2019-08-28 02:50:12] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2019-08-28 02:50:12] [NOTICE] current upstream node "db02" (ID: 2) is not primary, restarting monitoring
[2019-08-28 02:50:12] [INFO] searching for primary node
[2019-08-28 02:50:12] [INFO] checking if node 2 is primary
[2019-08-28 02:50:12] [INFO] current primary node is 2
[2019-08-28 02:50:12] [INFO] witness monitoring connection to primary node "db02" (ID: 2)
[2019-08-28 02:55:14] [INFO] witness node "pgmon" (ID: 3) monitoring primary node "db02" (ID: 2) in normal state

The System was unavailable for about 15 minutes, but no event was logged to repmgrd.log of the standby and no action was taken to promote db01 to primary.

repmgrd.log (standby)

[2019-08-28 01:52:15] [DETAIL] last monitoring statistics update was 2 seconds ago
[2019-08-28 03:56:33] [ERROR] unable to determine if server is in recovery
[2019-08-28 03:56:33] [DETAIL] 
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[2019-08-28 03:56:33] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2019-08-28 03:56:35] [INFO] node "db01" (ID: 1) monitoring upstream node "db02" (ID: 2) in normal state
[2019-08-28 03:56:35] [WARNING] monitoring_history requested but primary connection not available

The repmgrd process was running on the standby but the monitoring_history table was not updated. The last history-event was recorded at 01:56, that is exactly the time, when the problem started on the primary with a cpu-usage of 100% (iowait).

After i restarted repmgrd service on the standby everything runs fine. For me it looks like a 'half-dead' repmgrd.

Please let me know, if you need additional information about the configuration or outage.

Best regards
Marco

The text was updated successfully, but these errors were encountered:

ibarwick self-assigned this Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repmgrd does not recognize primary failure #585

Repmgrd does not recognize primary failure #585

mnietz commented Aug 28, 2019

Repmgrd does not recognize primary failure #585

Repmgrd does not recognize primary failure #585

Comments

mnietz commented Aug 28, 2019