Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database: add the ability to remove old job records from jobs table #459

Merged
merged 4 commits into from
Jun 27, 2024

Conversation

cmoussa1
Copy link
Member

@cmoussa1 cmoussa1 commented Jun 10, 2024

Problem

As it stands right now, the jobs table in the flux-accounting database grows without bound as more and more jobs finish running and are fetched by flux-accounting. However, at a certain point, job records become irrelevant for flux-accounting in terms of calculating job usage and fair-share values.


This PR adds a new command to flux-accounting called scrub-old-jobs which takes an integer argument representing the number of weeks to go back. It will remove any job records older than the number of weeks passed
in, e.g flux account scrub-old-jobs 4 will remove any job older than 4 weeks old. If no argument is specified, the default will be 26 weeks (or 6 months).

Some basic sharness tests are added that add some fake job records to the flux-accounting database and run the scrub-old-jobs command to test removing records.

A note about this command and optimizing the memory occupied by the database file by running VACUUM is also added to the flux-accounting guide.

@cmoussa1 cmoussa1 added the new feature new feature label Jun 10, 2024
@cmoussa1 cmoussa1 force-pushed the job.archive.purge branch 2 times, most recently from 38a6d98 to b98595b Compare June 11, 2024 21:08
@cmoussa1 cmoussa1 requested a review from chu11 June 11, 2024 21:36
@cmoussa1 cmoussa1 force-pushed the job.archive.purge branch from b98595b to ee6184c Compare June 24, 2024 16:01
@chu11
Copy link
Member

chu11 commented Jun 24, 2024

when do you expect users to run this command? I didn't know much about VACUUM but this on the sqlite website concerned me

A VACUUM will fail if there is an open transaction on the database connection that is attempting to run the VACUUM. Unfinalized SQL statements typically hold a read transaction open, so the VACUUM might fail if there are unfinalized SQL statements on the same connection. VACUUM (but not VACUUM INTO) is a write operation and so if another database connection is holding a lock that prevents writes, then the VACUUM will fail.

suggesting to me that scrub should only be done on a non-live system?

@cmoussa1
Copy link
Member Author

I only expect this to be run by admins on an infrequent periodic basis when they no longer require jobs past a certain date. Maybe I could just remove the VACUUM call in this function and make a note in documentation somewhere to point out that admins can manually call VACUUM by connecting to the database themselves and running it if the size of the DB becomes a problem?

@chu11
Copy link
Member

chu11 commented Jun 25, 2024

manually call VACUUM by connecting to the database themselves and running it if the size of the DB becomes a problem?

Yeah, I think that's a better idea, just documenting how to do it manually. Also, do we know how long VACUUM takes? I suspect it can take a while so maybe putting it in the code is a bad idea??

@cmoussa1 cmoussa1 force-pushed the job.archive.purge branch 3 times, most recently from 86f8fbe to 7aa8523 Compare June 25, 2024 16:32
@cmoussa1
Copy link
Member Author

OK sounds good. VACUUM can take minutes or even tens of minutes depending on how much space the database used to take up (as well as requiring an exclusive lock on the DB), so it probably is a good idea to leave it out of the code completely. 👍 Good point!

I've just force pushed a change to remove the VACUUM call to the function. I also added a note to the flux-accounting guide about passing the --scrub optional argument and also optionally connecting to the database manually and cleaning up occupied space with VACUUM.

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good o me. But I can't help but wonder if a "flux scrub-job-records" command would be better?

I guess the question is this scrubbing a one-off thing that you expect admins to do (presumably) during a downtime/maintenance/cleanup/something?

Or do you expect people to add this to a cron along with the normal call to flux account-fetch-job-records?

If the former, maybe a new command? If the latter, then I think this is good.

Just a subtle design choice to consider. You would know best on how people would probably use this.

@cmoussa1
Copy link
Member Author

That's a good point. I guess I was expecting this to be added onto the set of cron jobs already running for flux-accounting, i.e every time new jobs were fetched and stored into the flux-accounting DB, old ones could be removed. That way, admins don't have to worry about running a separate command to clean the DB of records themselves.

However, I could absolutely see wanting to make this a one-off command that is only run every so often. In that case, I can just make this a separate command entirely.

@ryanday36 might you have a preference on what the approach is here? Would you rather scrubbing of old jobs occur periodically and in the background (with every call to flux account-fetch-job-records? Or would you like to save something like this for a separate command that you run yourself whenever you want?

@ryanday36
Copy link
Contributor

My first thought is that I don't have a strong preference for flux account-fetch-job-records --scrub versus a separate flux account-scrub-job-records (or similar). I guess if it does turn out to impose some overhead to remove the jobs or VACUUM, it would probably be better have it as a separate command so that we can just run it off hours.

@cmoussa1
Copy link
Member Author

I think that's totally fair. I'll go ahead and modify this PR to instead just introduce a separate command instead of tacking on an optional argument to fetch-job-records. 👍

cmoussa1 added 2 commits June 26, 2024 09:34
Problem: There currently exists no way to clean the "jobs" table in the
flux-accounting DB of job records that no longer affect a user/bank's
job usage and fairshare values. This means that as time goes on, the
"jobs" table will grow without bound as more and more jobs are run on a
system.

Add a new command to flux-accounting called "scrub-old-jobs" which
takes an integer argument representing the number of weeks to go back.
It will remove any job records older than the number of weeks passed
in, e.g "flux account scrub-old-jobs 4" will remove any job older than
4 weeks old.
Problem: There is no test confirming that only privileged users can run
flux account scrub-old-jobs.

Add a test.
@cmoussa1 cmoussa1 force-pushed the job.archive.purge branch from 7aa8523 to 5f187ff Compare June 26, 2024 16:36
@@ -0,0 +1,75 @@
#!/bin/bash

test_description='test removing old job records from the flux-accounting database'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think a test for the default (no args == 26 weeks) is missing? Maybe should stick one more test record in there for something > 6 months old.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, probably should cover the path argument too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. I just pushed up a test that does both - specifying a path to the DB as well as the default for scrub-old-jobs (which should remove a simulated job that "finished" over six months ago).

cmoussa1 added 2 commits June 26, 2024 10:11
Problem: There exists no tests for the scrub-old-jobs command.

Add some tests.
Problem: The flux-accounting guide does not make a note of the
scrub-old-jobs command or running VACUUM on the database file if space
needs to be reclaimed.

Add a note to the flux-accounting guide about what the scrub-old-jobs
command does and how to tune it.

Make a note about connecting to the database file and running VACUUM if
we want to clean up and optimize the space occupied by the file.
@cmoussa1 cmoussa1 force-pushed the job.archive.purge branch from 5f187ff to d8382fa Compare June 26, 2024 17:11
@cmoussa1
Copy link
Member Author

I've restructured this PR to instead add a new command scrub-old-jobs instead of adding an optional argument to flux account-fetch-job-records. The flux-accounting guide has also been updated accordingly to account for the change. Thanks for the feedback so far @chu11 and @ryanday36!

@cmoussa1
Copy link
Member Author

@chu11 is this OK to go in? Sorry, I know you had already approved before I restructured this PR to use a separate command.

@chu11
Copy link
Member

chu11 commented Jun 27, 2024

oh yes, I think it's good to go!

@cmoussa1
Copy link
Member Author

Thanks! Setting MWP here

@mergify mergify bot merged commit ab3f9e6 into flux-framework:master Jun 27, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants