Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neo4j website down #48

Closed
jromanowska opened this issue Sep 16, 2022 · 6 comments
Closed

neo4j website down #48

jromanowska opened this issue Sep 16, 2022 · 6 comments

Comments

@jromanowska
Copy link

Hi, I've been trying to reach the hetionet at neo4j platform (https://neo4j.het.io/browser/) but it doesn't load - I've tried several internet explorers, both on Linux and Windows. It just spins and shows nothing.

@dhimmel dhimmel transferred this issue from hetio/het.io Sep 17, 2022
@dhimmel
Copy link
Member

dhimmel commented Sep 17, 2022

Ah thanks for the heads up. I though we created an uptime check in #45 (comment) that would restart the instance if it became unresponsive like this.

Tagging @falquaddoomi who helped last time. I can restart the instance, but might be good to keep this error active so we can make sure the uptime check detects it. (@falquaddoomi no rush, don't interrupt your weekend).

@falquaddoomi
Copy link

Sorry for the trouble you've been having with the sevice, @jromanowska. Also, hey @dhimmel; we do have an uptime check set up for the neo4j instance, but it just reports that the instance is inaccessible, it doesn't reboot it. Also, it's unfortunately very noisy, so it's hard to tell when a real outage is occurring versus a transient network issue on Google's side. I'd assumed since no one complained that these were just transient issues, but apparently not -- I'll look into them as soon as they come up now.

After looking into the logs a bit today, it seems the neo4j instance hits a series of out-of-memory exceptions that cause it to stop being able to fully service requests. Oddly, it'll still serve static resources, just with very high (30 seconds+) latency. I'm going to try bumping up the RAM on the instance, and I'll also add a daemon on the machine itself that checks if https://neo4j.het.io/browser/ is responsive and reboots the docker container if it isn't. I'll keep investigating why this is happening, since if there's a memory leak what I proposed will just delay the outages, not eliminate them.

Perhaps let's keep this issue open for a week or so to see if the issue's resolved, and after that we can close it?

@falquaddoomi
Copy link

Just FYI, I've put in a monitoring script that'll reboot the neo4j container if https://neo4j.het.io/browser/ takes longer than 30 seconds to return, or if it returns a non-200 response. I've also increased the RAM on the instance from 8GB to 12GB, and I'll be watching the logs and the uptime check for "transient" issues as well. Here's hoping that the changes I made will improves its stability, but do let me know if any of you have issues with it. 🤞

@dhimmel
Copy link
Member

dhimmel commented Sep 20, 2022

Thanks a lot @falquaddoomi! Stoked that we're able to automate the restarts.

I'll keep investigating why this is happening, since if there's a memory leak what I proposed will just delay the outages, not eliminate them.

But the outages will be short-lived and the reboot will reset the memory usage right?

Since the instance is running a pretty old version of Neo4j, there's probably not a ton of value in spending much time diagnosing the memory leak. I played around with upgrading in #33, but was hitting a bunch of problems.

So in summary, don't worry too much about digging into the memory leak unless you think that will create an actionable insight.

@falquaddoomi
Copy link

Right, the outages shouldn't be more than 5 minutes (that's the current polling interval), and if necessary the entire neo4j container gets restarted, which would reset its memory usage. Fair point about it not being worth tracking down a memory leak in an older version of neo4j. I'll take a look at #33 and see if I can make progress on it.

@dhimmel
Copy link
Member

dhimmel commented Sep 21, 2022

I'll take a look at #33 and see if I can make progress on it

Any help appreciated but a forewarning that there's several things that were breaking: guides, HTTPS, and more. So happy to video chat at any point and give you an overview of the hurdles if that'd be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants