Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Outages #99

Open
benfrancis opened this issue May 17, 2023 · 1 comment
Open

DNS Outages #99

benfrancis opened this issue May 17, 2023 · 1 comment
Labels

Comments

@benfrancis
Copy link
Member

STR:

  • Leave registration server running and wait

Expected:

  • It keeps working

Actual:

  • Tunnelling service (and webthings.io website) suddenly drop offline and are inaccessible until the registration server is rebooted

This has been happening regularly for many months now, and requires a reboot of the registration server EC2 instances in order to fix it. We believe it is caused by PowerDNS crashing so that the registration server no longer resolves DNS lookups.

In the logs of the registration server docker container there is an error which says "5001 questions waiting for database/backend attention. Limit is 5000, respawning". pdns then re-spawns and after that happens so many times, the init system in the docker container gives up and just kills it. This is happening on both EC2 instances.

We think that the DNS servers are occasionally getting overwhelmed by traffic but we don't know where it's coming from, I suspect it isn't WebThings users because there are lots of failed lookups for subdomains that don't exist in the logs.

Some potential solutions:

  1. Configuring rate limiting with something like dnsdist to set a limit on queries per second per IP address
  2. Re-configure pdns to use the gmysql back end so that pdns reads records directly from the database, rather than directing them to the registration server which then queries the database
  3. Modify the registration server by adding an option to use a hosted DNS service like Cloudflare as a back end, to take load off our EC2 instances. Downsides being 1. We would be dependent on Cloudflare 2. We'd have to set a TTL limit of minimum 60 seconds, so there would be brief outages when a gateway changes IP (but at least not the whole domain)
  4. Same as number 3, but re-write the registration server in Node.js so that more people are able to work on it (we have an IoT gateway written in Node.js and a cloud service written in Rust and it should probably be the other way around!)

My personal preference is to start with option 1 and see if it helps. I suspect the spikes in traffic are not coming from WebThings users and if we cut off the source of the excessive traffic the service would hopefully go back to being stable again.

If anyone has experience of configuring rate limiting for pdns, I would be grateful for some help.

@benfrancis benfrancis added the bug label May 17, 2023
@benfrancis
Copy link
Member Author

benfrancis commented Feb 6, 2025

Update: The RDS servers used for the database back end of pdns were recently upgraded to have double the RAM, but the registration server went down again 9 days later.

Some more complete logs...

System log of docker image:

{"log":"Feb 06 04:51:59 5001 questions waiting for database/backend attention. Limit is 5000, respawning\n","stream":"stdout","time":"2025-0
2-06T04:51:59.756787Z"}
{"log":"[2025-02-06T04:51:59Z ERROR registration_server::pdns] handle_socket_request(): JSON error: expected value at line 1 column 1\n","st
ream":"stdout","time":"2025-02-06T04:51:59.77352943Z"}
{"log":"[2025-02-06T04:51:59Z ERROR registration_server::pdns] read_json_from_stream(): Stream reading error: Connection reset by peer (os e
rror 104)\n","stream":"stdout","time":"2025-02-06T04:51:59.774275716Z"}
{"log":"[2025-02-06T04:51:59Z ERROR registration_server::pdns] handle_socket_request(): JSON error: EOF while parsing a value at line 1 colu
mn 0\n","stream":"stdout","time":"2025-02-06T04:51:59.774842757Z"}
{"log":"[2025-02-06T04:51:59Z ERROR registration_server::pdns] read_json_from_stream(): Stream reading error: Connection reset by peer (os e
rror 104)\n","stream":"stdout","time":"2025-02-06T04:51:59.77580027Z"}
{"log":"[2025-02-06T04:51:59Z ERROR registration_server::pdns] handle_socket_request(): JSON error: EOF while parsing a value at line 1 colu
mn 0\n","stream":"stdout","time":"2025-02-06T04:51:59.775817694Z"}
{"log":"[2025-02-06T04:51:59Z ERROR registration_server::pdns] read_json_from_stream(): Stream reading error: Connection reset by peer (os e
rror 104)\n","stream":"stdout","time":"2025-02-06T04:51:59.776522784Z"}
{"log":"[2025-02-06T04:51:59Z ERROR registration_server::pdns] handle_socket_request(): JSON error: EOF while parsing a value at line 1 colu
mn 0\n","stream":"stdout","time":"2025-02-06T04:51:59.7765383Z"}
{"log":"2025-02-06 04:51:59,777 INFO exited: pdns (exit status 1; not expected)\n","stream":"stdout","time":"2025-02-06T04:51:59.777860256Z"
}
{"log":"2025-02-06 04:52:00,779 INFO gave up: pdns entered FATAL state, too many start retries too quickly\n","stream":"stdout","time":"2025
-02-06T04:52:00.779555264Z"}

/var/log/supervisor/supervisord.log inside docker image:

2025-02-06 04:51:28,951 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:28,958 INFO spawned: 'pdns' with pid 9442
2025-02-06 04:51:29,841 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:30,845 INFO spawned: 'pdns' with pid 9456
2025-02-06 04:51:31,722 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:33,731 INFO spawned: 'pdns' with pid 9469
2025-02-06 04:51:35,037 INFO success: pdns entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-02-06 04:51:35,310 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:35,339 INFO spawned: 'pdns' with pid 9480
2025-02-06 04:51:36,270 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:37,541 INFO spawned: 'pdns' with pid 9495
2025-02-06 04:51:38,893 INFO success: pdns entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-02-06 04:51:39,832 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:39,841 INFO spawned: 'pdns' with pid 9512
2025-02-06 04:51:40,390 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:41,507 INFO spawned: 'pdns' with pid 9525
2025-02-06 04:51:42,756 INFO success: pdns entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-02-06 04:51:42,791 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:43,795 INFO spawned: 'pdns' with pid 9536
2025-02-06 04:51:44,872 INFO success: pdns entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-02-06 04:51:44,892 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:45,897 INFO spawned: 'pdns' with pid 9553
2025-02-06 04:51:46,922 INFO success: pdns entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-02-06 04:51:46,950 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:47,955 INFO spawned: 'pdns' with pid 9567
2025-02-06 04:51:48,964 INFO success: pdns entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-02-06 04:51:48,984 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:49,992 INFO spawned: 'pdns' with pid 9580
2025-02-06 04:51:50,978 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:51,983 INFO spawned: 'pdns' with pid 9596
2025-02-06 04:51:52,955 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:54,962 INFO spawned: 'pdns' with pid 9609
2025-02-06 04:51:55,858 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:51:58,865 INFO spawned: 'pdns' with pid 9620
2025-02-06 04:51:59,777 INFO exited: pdns (exit status 1; not expected)
2025-02-06 04:52:00,779 INFO gave up: pdns entered FATAL state, too many start retries too quickly

See also:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant