Time is not stopped when Disk Space Monitor is triggered and report files are removed #499

cgr71ii · 2022-09-09T13:38:40Z

Hi!

I'm crawling with the Disk Space Monitor enabled:

 <bean id="diskSpaceMonitor" class="org.archive.crawler.monitor.DiskSpaceMonitor">
   <property name="pauseThresholdMiB" value="5000" />
   <property name="monitorConfigPaths" value="true" />
   <property name="monitorPaths">
     <list>
      <value>/</value>
     </list>
   </property>
 </bean>

Sadly for me, it was triggered, but I noticed that the crawl time didn't stop, and I had another configuration in order to stop the crawl in 1 week:

 <bean id="crawlLimiter" class="org.archive.crawler.framework.CrawlLimitEnforcer">
  <!-- <property name="maxBytesDownload" value="0" /> -->
  <!-- <property name="maxDocumentsDownload" value="0" /> -->
  <property name="maxTimeSeconds" value="604800" /> <!-- Crawl for a week -->
 </bean>

I guess that if the (elapsed) time carries on, it will stop the crawl when the week is reached (I haven't made any further tests, so I'm not sure). Shouldn't the time stop in order to fix, if possible, the disk issue and then carry on with the crawl? Maybe I'm wrong and the elapsed time which is showed in the UI web is just for statistics and is not the same time for stopping the crawl. In that case, sorry for the misunderstanding :/

cgr71ii · 2022-09-10T10:19:12Z

By the way, when this happens, I've noticed that hosts-report.txt, seeds-report.txt, ... are empty. I guess this is because when the Disk Space Monitor is triggered, it stops all writes, and this might lead to inconsistent write operations and makes the file to don't finish correctly, but just guessing. Shouldn't the Disk Space Monitor let the log files finish correctly in order to do not lose all statistics?

When I try to open the hosts report in the web UI, the result is:

An error occurred
You may be able to recover and try something else by going [back](javascript:history.back();void(0);).
Cause: com.sleepycat.je.DiskLimitException: (JE 7.5.11) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: maxDiskLimit=0 freeDiskLimit=5,368,709,120 adjustedMaxDiskLimit=0 maxDiskOverage=0 freeDiskShortage=37,150,720 diskFreeSpace=5,331,558,400 availableLogSize=-37,150,720 totalLogSize=234,239,849,739 activeLogSize=234,239,849,739 reservedLogSize=0 protectedLogSize=0 protectedLogSizeMap={}

com.sleepycat.je.DiskLimitException: (JE 7.5.11) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: maxDiskLimit=0 freeDiskLimit=5,368,709,120 adjustedMaxDiskLimit=0 maxDiskOverage=0 freeDiskShortage=37,150,720 diskFreeSpace=5,331,558,400 availableLogSize=-37,150,720 totalLogSize=234,239,849,739 activeLogSize=234,239,849,739 reservedLogSize=0 protectedLogSize=0 protectedLogSizeMap={}
	at com.sleepycat.je.Cursor.checkUpdatesAllowed(Cursor.java:5337)
	at com.sleepycat.je.Cursor.checkUpdatesAllowed(Cursor.java:5314)
	at com.sleepycat.je.Cursor.putInternal(Cursor.java:2410)
	at com.sleepycat.je.Cursor.putInternal(Cursor.java:830)
	at com.sleepycat.je.Cursor.put(Cursor.java:787)
	at com.sleepycat.je.Cursor.put(Cursor.java:885)
	at com.sleepycat.util.keyrange.RangeCursor.put(RangeCursor.java:1055)
	at com.sleepycat.collections.DataCursor.put(DataCursor.java:802)
	at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:329)
	at com.sleepycat.collections.StoredMap.put(StoredMap.java:285)
	at org.archive.crawler.reporting.StatisticsTracker$2.execute(StatisticsTracker.java:866)
	at org.archive.modules.fetcher.DefaultServerCache.forAllHostsDo(DefaultServerCache.java:157)
	at org.archive.crawler.reporting.StatisticsTracker.calcReverseSortedHostsDistribution(StatisticsTracker.java:862)
	at org.archive.crawler.reporting.HostsReport.write(HostsReport.java:82)
	at org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:898)
	at org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:875)
	at org.archive.crawler.restlet.ReportGenResource.get(ReportGenResource.java:55)
	at org.restlet.resource.ServerResource.doHandle(ServerResource.java:603)
	at org.restlet.resource.ServerResource.doNegotiatedHandle(ServerResource.java:662)
	at org.restlet.resource.ServerResource.doConditionalHandle(ServerResource.java:348)
	at org.restlet.resource.ServerResource.handle(ServerResource.java:1020)
	at org.restlet.resource.Finder.handle(Finder.java:236)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Router.doHandle(Router.java:422)
	at org.restlet.routing.Router.handle(Router.java:641)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.engine.application.StatusFilter.doHandle(StatusFilter.java:140)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.engine.CompositeHelper.handle(CompositeHelper.java:202)
	at org.restlet.engine.application.ApplicationHelper.handle(ApplicationHelper.java:77)
	at org.restlet.Application.handle(Application.java:385)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Router.doHandle(Router.java:422)
	at org.restlet.routing.Router.handle(Router.java:641)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Router.doHandle(Router.java:422)
	at org.restlet.routing.Router.handle(Router.java:641)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.engine.application.StatusFilter.doHandle(StatusFilter.java:140)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.engine.CompositeHelper.handle(CompositeHelper.java:202)
	at org.restlet.Component.handle(Component.java:408)
	at org.restlet.Server.handle(Server.java:507)
	at org.restlet.engine.connector.ServerHelper.handle(ServerHelper.java:63)
	at org.restlet.engine.adapter.HttpServerHelper.handle(HttpServerHelper.java:143)
	at org.restlet.ext.jetty.JettyServerHelper$WrappedServer.handle(JettyServerHelper.java:237)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:540)
	at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:395)
	at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:779)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:911)
	at java.base/java.lang.Thread.run(Thread.java:829)

When there's less than 5 GiB free BDB throws DiskLimitException which Heritrix will likely be unable to handle gracefully and crawl job will break in various ways. #499

ato · 2022-09-10T14:27:52Z

I think the problem you've run into is that DiskSpaceMonitor is set at 5000 MiB which is lower than the BDB je.freeDisk default of 5 GiB (5120 MiB). So I think what happened is DiskSpaceMonitor didn't pause the crawl before BDB had already thrown a DiskLimitException exception. There's probably a lot of code in Heritrix that can't gracefully recover from a database exception.

It seems like a real gotcha that the default pause threshold is 500 MiB. Perhaps the BDB threshold changed at some point. To try to address this I've increased the default pause threshold to 8 GiB and added a note to the default job profile warning that you need to keep 5 GiB free for BDB.

cgr71ii · 2022-09-10T18:17:23Z

Oh, so you're saying that there's a hard limit imposed by DBD je.freeDisk which is not configurable? And this limit is 5 GiB, so the DiskSpaceMonitor should be configured with a value higher in order to be able to trigger instead of trigger the DBD exception, right?

ato · 2022-09-11T02:09:38Z

Yes.

BDB itself (not the Heritrix job config) does have a mechanism for configuring it by editing a file but the BDB documentation implies its set at 5 GiB for a good reason. I haven't looked into it deeply myself but there's some discussion in issue #340.

cgr71ii · 2022-09-11T11:02:33Z

Oh, ok! I hadn't understood very well the thread when I ran into it. Thank you for the explanation!

When there's less than 5 GiB free BDB throws DiskLimitException which Heritrix will likely be unable to handle gracefully and crawl job will break in various ways. internetarchive#499

cgr71ii changed the title ~~Time is not stopped when Disk Space Monitor is triggered~~ Time is not stopped when Disk Space Monitor is triggered and report files are removed Sep 10, 2022

cgr71ii closed this as completed Sep 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time is not stopped when Disk Space Monitor is triggered and report files are removed #499

Time is not stopped when Disk Space Monitor is triggered and report files are removed #499

cgr71ii commented Sep 9, 2022

cgr71ii commented Sep 10, 2022

ato commented Sep 10, 2022

cgr71ii commented Sep 10, 2022

ato commented Sep 11, 2022 •

edited

Loading

cgr71ii commented Sep 11, 2022 •

edited

Loading

Time is not stopped when Disk Space Monitor is triggered and report files are removed #499

Time is not stopped when Disk Space Monitor is triggered and report files are removed #499

Comments

cgr71ii commented Sep 9, 2022

cgr71ii commented Sep 10, 2022

ato commented Sep 10, 2022

cgr71ii commented Sep 10, 2022

ato commented Sep 11, 2022 • edited Loading

cgr71ii commented Sep 11, 2022 • edited Loading

ato commented Sep 11, 2022 •

edited

Loading

cgr71ii commented Sep 11, 2022 •

edited

Loading