Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: AbstractMultithreadedModule$ExceptionHandler:210 #314

Open
Mayoookh opened this issue Mar 31, 2020 · 26 comments
Open

ERROR: AbstractMultithreadedModule$ExceptionHandler:210 #314

Mayoookh opened this issue Mar 31, 2020 · 26 comments

Comments

@Mayoookh
Copy link

Mayoookh commented Mar 31, 2020

Hello everyone,
I have been facing an issue with PlanRouter . It appears after the simulation has already run some iterations. Here is a screenshot from the error:

image

(This has been happening for a while now but I upload pic of my recent simulation- after debugging it for certain iterations )

The warning says totalCost for ASTAR is 0 (which might be related to infinte free speed links on network) I have run the same simulation last year with the same network and there was no error. What could be the possible reason? Do I need to analyze my network again?

@josiekre
Copy link

Same issue here in at least two cities, so it is not only you @Mayoookh.

@kainagel
Copy link
Member

kainagel commented Apr 15, 2020

Could you please provide more information? Minimally MATSim version you are using, and full log file. And did you follow the recommendations further up on the logfile (no links of length zero, no links with speed infinity, ...)? Evidently, it would be good if we caught the problem earlier in the execution. I opened an issue: matsim-org/matsim-libs#963 .

The fact that it worked last year unfortunately is not an argument; given stochasticity, the routers may go in different ways through the network, and thus encounter a problematic link with one version of MATSim but not with another. :-(. (For the same reason, sometimes switching to landmarks helps.)

There were also some problems with multithreading in some particular configuration: #258 . This had to do with a particular use of the vehicles source; as long as you are not using that, I would not assume that this is your problem (although I evidently don't know).

@mrieser
Copy link
Contributor

mrieser commented Apr 15, 2020

I have access to some of the troublesome networks and am currently performing some kind of stress test by trying to route from each node in the network to every other node. Naturally, due to the size of the networks and the quadratic number of combinations I likely won't be able to run it completely. But still, it already managed to calculate several millions of routes without the error happening, and to at least route from one node to every other node of the network, which should pass more or less every 2nd link of the network (routing in one direction, assuming there is also always a reverse direction for each link).

Reading through all of #258, was the problem actually fixed? At least from the comments I was not able to see a connection between the vehicle source and the exceptions.

Trying to analyze the problem:

  • For whatever reason, the router (FastAStarLandmarks in at least one case) seems to find totalCost = 0.0 for a link.
  • The problem could be with the network (links with length 0 or freespeed=Infinity).
  • The problem could be with the TravelCost calculator.

I can exclude the networks. In cases from both Mayookh and Josie there are no links with length = 0. In Mayookh's network there are links with freespeed infinity, but their Ids are all prefixed with pt_ and none of them has car as available mode (the links look like generated by pt2matsim).

My guess currently goes towards the travel time and travel cost calculators, and maybe a concurrency issue there (either when querying, or when collecting the data). But without additional information this seems difficult to debug.

As a first step, I plan to include additional information in the log-message with currently states travelCost = 0.0, e.g. information about the link, the link length, the current time of day, the returned travel time, etc. Sadly this means we'll have to re-run the simulations with the newer versions, hoping the error happens again, and that we get additional information this way to debug the issue further.

Any other ideas?

@josiekre
Copy link

@kainagel I don't believe we are using vehicles source. Skimming through #258 makes me think such.

@mrieser I think you have all you need to rerun Atlanta and Miami with the newer version (assuming we can pull that in pretty easily being up to date on MATSim version now).

The other thing I could do is rerun Atlanta and Miami without any changes and see if the problem occurs in the same iteration again. I'm not familiar enough yet with what happens when in AStar router to know if that will tell us anything. @mrieser let me know.

@Mayoookh
Copy link
Author

I ran with 2 versions. Firstly, I was running with 12.0-2019w27-SNAPSHOT. I sent the log file of that to @mrieser . Yesterday I ran with version 11 (I also ran with 11 last year). The simulation just freezes at certain point without warning and error and doesn't go forward from there. You would notice that in the log file here :

https://www.dropbox.com/sh/m0aun0gtdjypu67/AABtasX-k1Sj__6yzd-MDNaMa?dl=0

We do use modified vehicle file with emission related information and the file is uploaded too.
Last year we modified the network (which I have already sent to @mrieser ) by making parallel links for buses (with same ID as the original link but with a prefix PuT ) but I am not sure that it would cause problems as these links don't have "car" as mode.

@mrieser
Copy link
Contributor

mrieser commented Apr 16, 2020

I pushed a commit which includes additional logging output around the area where the router currently fails. I also attach a jar-file with the MATSim-build from that commit, so everybody could try it for themselves. Please be aware that this is based on the most recent MATSim code, so if you are running e.g. with 12.0-2019w27-SNAPSHOT, there might be some incompatibilities due to changes in the code during the last few months.

I will continue my testing with this code and see if I figure something out.

matsim-12.0-astardebug1.jar.zip

@mrieser
Copy link
Contributor

mrieser commented Apr 17, 2020

Just an observation: In one case, FastAStarLandmarks was used, and in the other case AStarLandmarks. They are very similar and share a lot of code, so it's not surprising it happens in both cases, but for a moment I was wondering if it was due to the use of FastAStarLandmarks as I would otherwise expect more people to run into these troubles...

@kainagel
Copy link
Member

kainagel commented Apr 18, 2020 via email

@mrieser
Copy link
Contributor

mrieser commented Apr 18, 2020

@Mayoookh is the code of your main class that you use to start the simulation available somewhere in public, e.g. on GitHub? Considering how many people use MATSim, and hew few have reported this problem, I suspect something very scenario specific, which is why being able to look at the code would be helpful.

@Mayoookh
Copy link
Author

Hi @mrieser , I have uploaded few files here, which I think were necessary :

https://www.dropbox.com/home/MATSim

If you need more information, I will push my code, for the above, on GitHub soon.

I will also change the routing algorithm from astar to landmarks and keep you in the loop for results.

@mrieser
Copy link
Contributor

mrieser commented Apr 19, 2020

Some Update: it's still work in progress.

In one run with additional output, I received the following log statements while running PersonPrepareForSim before the first iteration actually started:

2020-04-17T20:41:58,908  WARN AStarEuclidean:156 Expected travelCost > 0, but was 0.0.
   Link: 20237723 length=157.0 freespeed=13.4112.
   TravelTime=0.0. TimeFunction=org.matsim.core.trafficmonitoring.TravelTimeCalculator$1.
   CurrTime=93300.83840372231. CostFunction=org.matsim.core.router.costcalculators.RandomizingTimeDistanceTravelDisutility.
   Person=27003050107_2016001137191_105_2  Vehicle=unknown

So there is a link with Length > 0, freespeed < Infinity, but the TravelTime is 0.0, although it should be at least 11 seconds.

After some digging (and based on other, later messages), I've found a problem in the preparation of the network (in this case, the network is converted from some source data and directly used in the simulation, without it first being written to a network.xml file and being read back in). That was the reason why I was also interested in Mayookh's code, but in his case, the network is read from file, so this could not be the reason. Also, after fixing the network issue, the errors appeared again in a run, although this time suddenly in the replanning of iteration 6. I've added additional debugging output, and this showed the following:

2020-04-19T11:41:13,595  WARN TravelTimeCalculator$1:520 Unexpected travel time: 0.0 link=891815942_F time=07:39:13 ttVeh=0.0 ttObs=0.0
2020-04-19T11:41:13,627  WARN AStarEuclidean:156 Expected travelCost > 0, but was 0.0.
   Link: 891815942_F length=450.0 freespeed=8.9408.
   TravelTime=0.0. TimeFunction=org.matsim.core.trafficmonitoring.TravelTimeCalculator$1.
   CurrTime=27553.676688141255. CostFunction=org.matsim.core.router.costcalculators.RandomizingTimeDistanceTravelDisutility.
   Person=27003050107_2015000055075_35_4  Vehicle=unknown

So it's actually the TravelTimeCalculator that already returns the bad travel times. ttVeh=0.0 is understandable from the code, as no vehicle and thus no vehicle type is available. But why the observed travel time (ttObs) is 0 is unclear.

The problem being in TravelTimeCalculator explains why both AStarLandmarks and FastAStarLandmarks are affected.

But I have not yet found the reason why the TravelTimeCalculator returns the bad data. An additional run with even more debugging output for this was started...

@kainagel
Copy link
Member

Thank you so much that you are digging into this. One remark, maybe irrelevant: There is only one instance of TravelTimeCalculator, since in theory only one observer is needed. There is already a synchronized inside consolidateData. However, IntelliJ complains:
image

@gvermillion
Copy link

gvermillion commented Apr 20, 2020

In case this is useful information, over the weekend I re-ran a city that got caught in this "infinite loop" error to see if it happened in the same spot---it did not. The run has since been terminated due to cost, so I'm not certain if it would have run into the error again.

@Mayoookh
Copy link
Author

@mrieser Thank you for a continuous update. I have pushed my code to start the simulation here : https://github.com/Mayoookh/Matsim-Scenarios/tree/master/src/main/java/org/matsim/project

I have not pushed the input files though.
From your comment, it gets clear that there is no problem in my network. And since there is problem with TravelTimeCalculator (?), using Dijkstara isn't also a quick solution for my case.

@Mayoookh
Copy link
Author

The fact that it worked last year unfortunately is not an argument; given stochasticity, the routers may go in different ways through the network, and thus encounter a problematic link with one version of MATSim but not with another. :-(

@kainagel I tried running the model with 11.0 , the same version I used last year. I am still having the same problem. So maybe its also because of stochasticity of routers in same version too(?) (Even though I ran 16 scenarios last time and didn't run into this problem)

@mrieser
Copy link
Contributor

mrieser commented Apr 21, 2020

I'm more and more confident that it's a multi-threading issue, and that there is some strange race-condition I don't yet understand. It's at least the only explanation why it sometimes works, sometimes fails early and sometimes fails late, although it is always the same input.

There might be also other influences:

  • The exact java version used (Java might have introduced some minor change in caching or synchronization somewhere that we may trigger now)
  • The OS used
  • The CPU used (different CPUs might have different caching behavior for multiple threads).

Did any of these things change in your case, Mayookh?

In the case of Josie and Grant, we switched from Java 8 to Java 11.0.6, and run it on AWS inside a Docker container (we also upgraded the docker image to use the newer Java version). So there have been some changes, and there are parts we cannot control (e.g. the hardware AWS uses for the instances).

@Mayoookh
Copy link
Author

Mayoookh commented Apr 22, 2020

@mrieser this actually helped. I was using JDK 12 this time. I switched back to 8 and the simulation worked. It ran all the iterations without running into this problem. I would go forward with it for now and dig into the problem later

@mrieser
Copy link
Contributor

mrieser commented Apr 25, 2020

Update, still work in progress.
There is something mysterious going on. I've printed out the content of the arrays storing the travel times per time-bin. In the following case of a link that was never driven on in the last iteration (all counts cnt are zero 0). The travel times are thus -1 meaning no measurement, and the class would then return the freespeed travel time.

2020-04-22T16:19:37,822  INFO TravelTimeDataArray:93 internal data for 1240683802_F  
  tt=[-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0, -1.0, 0.0, 0.0, -1.0]  
  cnt=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

But:

  • The travel times are sometimes 0.0. And when the travel time in one time bin is 0, then the next one is guaranteed to also be 0, but the one after not! (I checked over 430k such lines with grep, it never found any line with 3 consecutive zeros). So the 0.0 always only appear in pairs.
  • Between two pairs of '0.0, 0.0' there are also at least 2 non-zero values.
  • The zeros can start appearing at even or odd indices. So it's not something like a for-loop where the counter variable is incremented twice by accident or so.

The zeros also appearing on links that have no current measurement seems to rule out the problem of a concurrency issue with collecting the data (i.e. in the Events handler during the mobsim phase, when all the values are coming in).

This pair-wise appearance is very strange, and I still have not yet figured out, how the zeros end up in those spaces... I've now basically added a logging statement to every place where the value is set, trying to figure out where these values come from.

@josiekre
Copy link

We have some new information. @gvermillion ran a smaller city (Flagstaff, AZ) yesterday for an experiment using a jar from Java 11. He kicked off 18 simulations at the same time using different scoring parameters for each. Of the 18, 4 got stuck in the infinite loop.

Flagstaff has a network of ~150k links (all streets-- large rural area) and only ~140k agents (100% of the population) and usually takes about 3h to run.

We originally thought this problem was limited to larger cities, but it turns out that assumption is not true.

@Mayoookh
Copy link
Author

Mayoookh commented May 12, 2020

I checked my network using KNIME . There were no links with zero length (as the warning said). Only the PT links had infinite speed with modes "bus, artificial". Moreover, I tried both Astar landmarks and Dijkstra but the problem persists.
@mrieser are there any updates on this problem? Thanks for your help!

@mrieser
Copy link
Contributor

mrieser commented May 12, 2020

In one case, we temporarily solved it by switching from Java 11 back to Java 8: In Java 11, 4 out of 18 runs produced the error, back on Java 8 the same scenario ran around 60 times without a problem.

I was now able to reproduce the problem with a smaller scenario on a local server (no cloud, no Docker, but bare-metal). I'm currently running more tests with additional logging output, but as soon as I include too many conditional logging statements, the error no longer happens... which makes the debugging hard (my interpretation: with the additional logging statements, the methods e.g. become too large for inlining or other JVM-internal effects which then no longer trigger the bug). So I'm currently experimenting on how I can reproduce the bug but still include enough debugging to dig deeper. It's work in progress, and usually it runs for 2 days or so until I can analyze and restart it, so it takes some time.

At the moment, the safest really seems to be to use Java 8.

@kainagel
Copy link
Member

kainagel commented May 13, 2020

If you can somewhat deterministically reproduce the error: Have you tried replacing the short[] in TravelTimeDataArray with int[] or even long[]?

@mrieser
Copy link
Contributor

mrieser commented May 13, 2020

this was an int[] until last October, and Mayookh is using a version of MATSim that predates this change. So he is having the problem with an int[] actually. Haven't tried with a long[] yet.

@Mayoookh
Copy link
Author

At the moment, the safest really seems to be to use Java 8.

I too switched from 12 to 8 and then it worked fine for me

@mrieser this actually helped. I was using JDK 12 this time. I switched back to 8 and the simulation worked.

Just for information, I am running various scenarios. In one of the scenarios, I just have 3 modes, i.e. walk, car and bike (without PT ) - its running again into the same problem, even on Java 8. It doesn't throw error but it just freezes after warning of "Astar totalcost = 0.0".

@sebhoerl
Copy link
Contributor

sebhoerl commented May 20, 2020

Just wanted to add a comment to this conversation. I'm facing the same problem now. I was moving a simulation from one server to another and suddenly I kept getting the NullPointerException with totalCost <= 0. After trying everything possible, I noticed that the old server had Java 8, the new one Java 11. Switching back to Java 8 everything runs smoothly again. So the cause really seems to be related to that. No further insight yet, though. (And this is still a MATSim 11 simulation).

@mrieser
Copy link
Contributor

mrieser commented May 25, 2020

I have now cleaned up my debugging work and created a pull-request. While doing so, I realized that I was not able to reproduce the issue anymore after I did some minor performance-optimizations to the code while debugging. So there is hope that these modifications actually prevent the issue from triggering, although I still do not yet understand how the unexpected values could have crept in.

If you are affected by this bug:

  • If you use an older version of MATSim, try to run it with Java 8 instead of Java 11 or newer.
  • If you use a newer version of MATSim that requires Java 11, try to use a version which has these changes included (likely a weekly-release 2020w22 or newer)

If the problem occurs with Java 11 and a newer MATSim version, please leave a comment with details (at least MATSim version, Java version, helpful would also be information about the OS being used etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants