Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time is not stopped when Disk Space Monitor is triggered and report files are removed #499

Closed
cgr71ii opened this issue Sep 9, 2022 · 5 comments

Comments

@cgr71ii
Copy link

cgr71ii commented Sep 9, 2022

Hi!

I'm crawling with the Disk Space Monitor enabled:

 <bean id="diskSpaceMonitor" class="org.archive.crawler.monitor.DiskSpaceMonitor">
   <property name="pauseThresholdMiB" value="5000" />
   <property name="monitorConfigPaths" value="true" />
   <property name="monitorPaths">
     <list>
      <value>/</value>
     </list>
   </property>
 </bean>

Sadly for me, it was triggered, but I noticed that the crawl time didn't stop, and I had another configuration in order to stop the crawl in 1 week:

 <bean id="crawlLimiter" class="org.archive.crawler.framework.CrawlLimitEnforcer">
  <!-- <property name="maxBytesDownload" value="0" /> -->
  <!-- <property name="maxDocumentsDownload" value="0" /> -->
  <property name="maxTimeSeconds" value="604800" /> <!-- Crawl for a week -->
 </bean>

I guess that if the (elapsed) time carries on, it will stop the crawl when the week is reached (I haven't made any further tests, so I'm not sure). Shouldn't the time stop in order to fix, if possible, the disk issue and then carry on with the crawl? Maybe I'm wrong and the elapsed time which is showed in the UI web is just for statistics and is not the same time for stopping the crawl. In that case, sorry for the misunderstanding :/

@cgr71ii
Copy link
Author

cgr71ii commented Sep 10, 2022

By the way, when this happens, I've noticed that hosts-report.txt, seeds-report.txt, ... are empty. I guess this is because when the Disk Space Monitor is triggered, it stops all writes, and this might lead to inconsistent write operations and makes the file to don't finish correctly, but just guessing. Shouldn't the Disk Space Monitor let the log files finish correctly in order to do not lose all statistics?

When I try to open the hosts report in the web UI, the result is:

An error occurred
You may be able to recover and try something else by going [back](javascript:history.back();void(0);).
Cause: com.sleepycat.je.DiskLimitException: (JE 7.5.11) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: maxDiskLimit=0 freeDiskLimit=5,368,709,120 adjustedMaxDiskLimit=0 maxDiskOverage=0 freeDiskShortage=37,150,720 diskFreeSpace=5,331,558,400 availableLogSize=-37,150,720 totalLogSize=234,239,849,739 activeLogSize=234,239,849,739 reservedLogSize=0 protectedLogSize=0 protectedLogSizeMap={}

com.sleepycat.je.DiskLimitException: (JE 7.5.11) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: maxDiskLimit=0 freeDiskLimit=5,368,709,120 adjustedMaxDiskLimit=0 maxDiskOverage=0 freeDiskShortage=37,150,720 diskFreeSpace=5,331,558,400 availableLogSize=-37,150,720 totalLogSize=234,239,849,739 activeLogSize=234,239,849,739 reservedLogSize=0 protectedLogSize=0 protectedLogSizeMap={}
	at com.sleepycat.je.Cursor.checkUpdatesAllowed(Cursor.java:5337)
	at com.sleepycat.je.Cursor.checkUpdatesAllowed(Cursor.java:5314)
	at com.sleepycat.je.Cursor.putInternal(Cursor.java:2410)
	at com.sleepycat.je.Cursor.putInternal(Cursor.java:830)
	at com.sleepycat.je.Cursor.put(Cursor.java:787)
	at com.sleepycat.je.Cursor.put(Cursor.java:885)
	at com.sleepycat.util.keyrange.RangeCursor.put(RangeCursor.java:1055)
	at com.sleepycat.collections.DataCursor.put(DataCursor.java:802)
	at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:329)
	at com.sleepycat.collections.StoredMap.put(StoredMap.java:285)
	at org.archive.crawler.reporting.StatisticsTracker$2.execute(StatisticsTracker.java:866)
	at org.archive.modules.fetcher.DefaultServerCache.forAllHostsDo(DefaultServerCache.java:157)
	at org.archive.crawler.reporting.StatisticsTracker.calcReverseSortedHostsDistribution(StatisticsTracker.java:862)
	at org.archive.crawler.reporting.HostsReport.write(HostsReport.java:82)
	at org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:898)
	at org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:875)
	at org.archive.crawler.restlet.ReportGenResource.get(ReportGenResource.java:55)
	at org.restlet.resource.ServerResource.doHandle(ServerResource.java:603)
	at org.restlet.resource.ServerResource.doNegotiatedHandle(ServerResource.java:662)
	at org.restlet.resource.ServerResource.doConditionalHandle(ServerResource.java:348)
	at org.restlet.resource.ServerResource.handle(ServerResource.java:1020)
	at org.restlet.resource.Finder.handle(Finder.java:236)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Router.doHandle(Router.java:422)
	at org.restlet.routing.Router.handle(Router.java:641)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.engine.application.StatusFilter.doHandle(StatusFilter.java:140)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.engine.CompositeHelper.handle(CompositeHelper.java:202)
	at org.restlet.engine.application.ApplicationHelper.handle(ApplicationHelper.java:77)
	at org.restlet.Application.handle(Application.java:385)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Router.doHandle(Router.java:422)
	at org.restlet.routing.Router.handle(Router.java:641)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Router.doHandle(Router.java:422)
	at org.restlet.routing.Router.handle(Router.java:641)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.engine.application.StatusFilter.doHandle(StatusFilter.java:140)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.routing.Filter.doHandle(Filter.java:150)
	at org.restlet.routing.Filter.handle(Filter.java:197)
	at org.restlet.engine.CompositeHelper.handle(CompositeHelper.java:202)
	at org.restlet.Component.handle(Component.java:408)
	at org.restlet.Server.handle(Server.java:507)
	at org.restlet.engine.connector.ServerHelper.handle(ServerHelper.java:63)
	at org.restlet.engine.adapter.HttpServerHelper.handle(HttpServerHelper.java:143)
	at org.restlet.ext.jetty.JettyServerHelper$WrappedServer.handle(JettyServerHelper.java:237)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:540)
	at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:395)
	at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:779)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:911)
	at java.base/java.lang.Thread.run(Thread.java:829)

@cgr71ii cgr71ii changed the title Time is not stopped when Disk Space Monitor is triggered Time is not stopped when Disk Space Monitor is triggered and report files are removed Sep 10, 2022
ato added a commit that referenced this issue Sep 10, 2022
When there's less than 5 GiB free BDB throws DiskLimitException which
Heritrix will likely be unable to handle gracefully and crawl job will
break in various ways. #499
@ato
Copy link
Collaborator

ato commented Sep 10, 2022

I think the problem you've run into is that DiskSpaceMonitor is set at 5000 MiB which is lower than the BDB je.freeDisk default of 5 GiB (5120 MiB). So I think what happened is DiskSpaceMonitor didn't pause the crawl before BDB had already thrown a DiskLimitException exception. There's probably a lot of code in Heritrix that can't gracefully recover from a database exception.

It seems like a real gotcha that the default pause threshold is 500 MiB. Perhaps the BDB threshold changed at some point. To try to address this I've increased the default pause threshold to 8 GiB and added a note to the default job profile warning that you need to keep 5 GiB free for BDB.

@cgr71ii
Copy link
Author

cgr71ii commented Sep 10, 2022

Oh, so you're saying that there's a hard limit imposed by DBD je.freeDisk which is not configurable? And this limit is 5 GiB, so the DiskSpaceMonitor should be configured with a value higher in order to be able to trigger instead of trigger the DBD exception, right?

@ato
Copy link
Collaborator

ato commented Sep 11, 2022

Yes.

BDB itself (not the Heritrix job config) does have a mechanism for configuring it by editing a file but the BDB documentation implies its set at 5 GiB for a good reason. I haven't looked into it deeply myself but there's some discussion in issue #340.

@cgr71ii
Copy link
Author

cgr71ii commented Sep 11, 2022

Oh, ok! I hadn't understood very well the thread when I ran into it. Thank you for the explanation!

@cgr71ii cgr71ii closed this as completed Sep 11, 2022
ato added a commit to nla/heritrix3 that referenced this issue Jan 23, 2023
When there's less than 5 GiB free BDB throws DiskLimitException which
Heritrix will likely be unable to handle gracefully and crawl job will
break in various ways. internetarchive#499
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants