A busy torrust tracker, works really well #908

lestercrest99 · 2024-06-21T00:56:30Z

lestercrest99
Jun 21, 2024

My tracker is serving 22.72 billion requests every month, and 22.88 TB data transfer

uptime is 215 days
mem usage and cpu usage is very good, doesn't exceed %400 cpu under heavy load and 4 gb mem
can handle 9000 req/s behind nginx

api/v1/stats sometimes timeouts, though. I wonder why. maybe i'd create an issue some time

josecelano · 2024-06-21T12:19:47Z

josecelano
Jun 21, 2024
Maintainer

Hi @lestercrest99, I'm glad to have user feedback, especially good feedback :-). We are also running a demo of the Tracker and Index on a small virtual machine to ensure the admin experience is as good as possible.

Regarding your problem with the timeout, I think you probably are getting a timeout because of this line: https://github.com/torrust/torrust-tracker/blob/develop/src/servers/apis/routes.rs#L31

We have set a hardcoded value (5 seconds) for any request to the API. If that's OK for your case, we could extract that value into a configuration option so you can increase it. Maybe you prefer to receive a response after 15 seconds rather than the timeout.

Anyway, we should find out why the API does not have time to process the request. I guess it's because of the server's high load.

I would guess the problem is that the trackers (UDP and HTTP) compete to acquire the lock to update the statistics object in memory. There can be only one writer or many readers for the statistics. Whenever the server receives a connect/announce/scrape request, the thread handling the request must lock the object for writing. I think RwLock prioritizes writers, so the API never has the chance to read the stats under a heavy load. If we confirm that's the problem, we could try to implement a solution.

How to confirm the problem

We could enable the trace log level and analyze the logs when a timeout happens. We can also add a warning to the logs when the timeout happens (if it's not already there. We are using a standard middleware for that).

If that's not feasible for you, I can do it in our demo instance. It's a small server, and we could have the same problem. In fact, it's restarted more or less once a week because it runs out of memory, it starts going really slow and the docker container healthcheck rule makes the container restart.

How to solve the problem

Proposal 1: add a cache for stats

The API could load the stats not from the core source in the tracker but from a copy in the API context. We could copy stats to the cache every X seconds (interval with config option) and make sure the thread that makes the copy can acquire the lock.

Pros:

No timeout for this reason. There could be timeouts for other reasons.

Cons:

The stats are not the latest, but that never happens. By the time you get the stats from the object and pass it to the API client, those stats have already changed.

Regarding this solution, I've been thinking that maybe we should isolate the API data from core data. I mean, the API service (API context) should collect the data from the core tracker (core context) and make a copy of whatever the API needs. For example, API endpoints returning torrents' info get the data directly from the Torrent Repository. In that case, the API must also compete with the core tracker. An intensive API use can affect the tracker's performance and the opposite. Maybe this solution does not make sense because, under high-load pressure because you introduce a new task: copying the data. Maybe the only solution is just to scale up the server.

This problem could be even more complex in the future if we add more stats like the ones we are discussing here.

Proposal 2: introduce events

The core tracker does not maintain stats. It only emits events. Events can be consumed externally, for example, by the API. The API could build a projection with the aggregate stats.

We could also add API endpoints to get the events. Third-party clients can build their stats.
We could build a projection not only for the aggregate data but for historical data. Aggregate data every 15 minutes.

It would be helpful if it's a problem for you to lose the aggregate data at a given moment. I mean, maybe you absolutely need the stats every 15 minutes, and you have a timeout for 30 minutes, so there is no way to fill the data in the middle of the timeouts. I suppose that's not your problem. I don't think tracker stats are so important.

Conclusion

For the time being, I would add logs to confirm the reason and increase the timeout or scale up the server, depending on what's better for you.

I would also research if there are other ways to avoid readers' starvation if that's the problem. For example:

https://docs.rs/parking_lot/latest/parking_lot/type.RwLock.html

cc @da2ce7

1 reply

lestercrest99 Jun 22, 2024
Author

actually, it's pretty good, it rarely times out and after 30-40 sec of trying, it loads updated stats fast. sometimes it's the other way around. I don't think it's important enough
also yeah it could be affected by server load, server isn't the best

thanks for your detailed reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A busy torrust tracker, works really well #908

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

A busy torrust tracker, works really well #908

lestercrest99 Jun 21, 2024

Replies: 1 comment · 1 reply

josecelano Jun 21, 2024 Maintainer

How to confirm the problem

How to solve the problem

Proposal 1: add a cache for stats

Proposal 2: introduce events

Conclusion

lestercrest99 Jun 22, 2024 Author

lestercrest99
Jun 21, 2024

Replies: 1 comment 1 reply

josecelano
Jun 21, 2024
Maintainer

lestercrest99 Jun 22, 2024
Author