-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time is not stopped when Disk Space Monitor is triggered and report files are removed #499
Comments
By the way, when this happens, I've noticed that When I try to open the hosts report in the web UI, the result is:
|
When there's less than 5 GiB free BDB throws DiskLimitException which Heritrix will likely be unable to handle gracefully and crawl job will break in various ways. #499
I think the problem you've run into is that DiskSpaceMonitor is set at 5000 MiB which is lower than the BDB je.freeDisk default of 5 GiB (5120 MiB). So I think what happened is DiskSpaceMonitor didn't pause the crawl before BDB had already thrown a DiskLimitException exception. There's probably a lot of code in Heritrix that can't gracefully recover from a database exception. It seems like a real gotcha that the default pause threshold is 500 MiB. Perhaps the BDB threshold changed at some point. To try to address this I've increased the default pause threshold to 8 GiB and added a note to the default job profile warning that you need to keep 5 GiB free for BDB. |
Oh, so you're saying that there's a hard limit imposed by DBD je.freeDisk which is not configurable? And this limit is 5 GiB, so the DiskSpaceMonitor should be configured with a value higher in order to be able to trigger instead of trigger the DBD exception, right? |
Yes. BDB itself (not the Heritrix job config) does have a mechanism for configuring it by editing a file but the BDB documentation implies its set at 5 GiB for a good reason. I haven't looked into it deeply myself but there's some discussion in issue #340. |
Oh, ok! I hadn't understood very well the thread when I ran into it. Thank you for the explanation! |
When there's less than 5 GiB free BDB throws DiskLimitException which Heritrix will likely be unable to handle gracefully and crawl job will break in various ways. internetarchive#499
Hi!
I'm crawling with the Disk Space Monitor enabled:
Sadly for me, it was triggered, but I noticed that the crawl time didn't stop, and I had another configuration in order to stop the crawl in 1 week:
I guess that if the (elapsed) time carries on, it will stop the crawl when the week is reached (I haven't made any further tests, so I'm not sure). Shouldn't the time stop in order to fix, if possible, the disk issue and then carry on with the crawl? Maybe I'm wrong and the elapsed time which is showed in the UI web is just for statistics and is not the same time for stopping the crawl. In that case, sorry for the misunderstanding :/
The text was updated successfully, but these errors were encountered: