-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DirectFileStore
creates too many files for long running applications
#143
Comments
This is related to issue #109, namely, how do you know whether a file is being written to by a process that's still alive. It's not exactly the same problem, since in this case, we can't get rid of files from old processes (or else counters would go down), but I have a feeling both problems will be solved in a related way. |
I would think an appropriate fix for this would be to have each process open a shared lock on the file it manages, for the duration of it needing the file. Then, every time a file store tries creating a new file, you examine all those that exist in the directory that you can exclusively lock and compact them into a new file, while destroying all those that existed before. Provided you think about how to do concurrency control, this would be a decent mechanism for solving it. But at this point I reckon you should probably be using an mmap system. |
Is there a locking system that will work across all OSes? |
That would certainly be a consideration, and you'd probably end up with the compaction as an optional configuration on the file store. But honestly, I'd rather see a more performant mmap approach before doing this. Just wanted to sketch out a possible implementation. |
We haven't solved this, and we're using mmap over in Python. The reads were actually changed to not use mmap recently for performance. |
Another potential option would be to have one file per process, rather than one file per process/metric. This doesn't fundamentally solve the issue, but it makes it orders of magnitude less problematic. The downside to this is that now each metric increment has essentially a mutex around it, so it is not the best performance in multi-threaded scenarios. But if, like most Ruby apps, each process is running single-threaded, or if it's not incrementing counters that often, the performance penalty will probably be negligible. Each store save should be in the order of single-digit microseconds after all. This would need a separate store (i'm not proposing modifying the existing file store to do this), but that's the point of having the swappable store backends, and it would be significantly easier to write than the compaction we're talking about here. |
I'm taking this out of the v0.10.0 milestone. I think we should come up with an answer to this problem, but I also think there's a lot of value in getting 0.10.0 out of alpha and into more people's hands in its current state. I'm going to try and get through the documentation issues so we can get to a release. After that, we can go one of a couple of ways:
|
This was a quick experiment on having all metrics for each process on the same file, because it seemed like it would be easy. (just do these things in this commit) However, even though tests pass, this doesn't actually work. See next commit on why. If the change were just this, i'd say we should do this. However, as you'll see in the next commit, it's more involved than that, and i'm not sure it's worth doing, at least not with this approach... I'm just pushing this up so it doesn't get lost.
We use passenger to run our rails application. Worker processes recreated after every 100 requests. Recently we tried client_ruby with
DirectFileStore
. Each worker process has its own pid so depending on server load there will be tens of thousands of files in prometheus work dir after couple of hours/days when app wasn't restarted.With 50k files in prometheus work dir /metrics route starts to be very slow with processing time about 10 seconds which could lead to prometheus scraper timeouts.
Is there any workaround for long running processes with
DirectFileStore
? One possible workaround I can think of is just restart app once in a few hours. Instead of restarting the whole application we can just wipe prometheus work directory once in a few hours, but this is looks a little bit hacky for me.The text was updated successfully, but these errors were encountered: