Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fargate] [regression]: potential network instability in 1.4.0 #992

Open
ghost opened this issue Jul 21, 2020 · 4 comments
Open

[Fargate] [regression]: potential network instability in 1.4.0 #992

ghost opened this issue Jul 21, 2020 · 4 comments
Labels
Fargate PV1.4 Fargate AWS Fargate Proposed Community submitted issue

Comments

@ghost
Copy link

ghost commented Jul 21, 2020

Apologies in advance for a relatively vague issue description but unfortunately even with debug logging it is difficult to shed a light on this.

Basically we have been running logstash (latest stable) on fargate 1.3.0 with aws-msk (kafka) as input (=consumer) for a couple of months with not a single issue - no crashes, including autoscaling triggering based on CPU usage.

As soon as we switched to 1.4.0 though, we started noticing logstash suddenly and unfortunately silently crashing. E.g. logstash as a container/process is still running, health checks are fine, but its was not consuming any logs anymore.

Upon further inspection we can unfortunately not see anything in the logs, however we can tell that the connection to kafka is not anymore active.

Switching back to 1.3.0 without any further changes brought back the stability (running for over a week without a single "crash").

In my limited visibility I am concluding, and in conjunction with: logstash-plugins/logstash-integration-kafka#15, that the connection to kafka is maybe dropping unexpectedly which logstash at that point is unable to catch.

Is there for some reason a possibility that between 1.3.0 and 1.4.0 a connection timeout existing, or are there any other changes that might interfere with kafka in any way?

We have been running other applications that mostly have short-lived connections (including a kafka producer) on 1.4.0 for weeks without any issues.

@ghost ghost added the Proposed Community submitted issue label Jul 21, 2020
@ddyzhang
Copy link

Hi @guntergt

The Fargate team is currently tracking a known issue where customer applications could hang/become unresponsive when the application is pushing a large volume of logs to the logging driver configured in the ECS task definition. If possible, can you please try disabling logging completely in your ECS task definition and try running on PV 1.4.0 again?

Thank you.

@ghost
Copy link
Author

ghost commented Jul 22, 2020

Thanks @ddyzhang we aware of that issue, however the logstash ecs service itself hardly pushes any logs to cloudwatch logs. It will consume logs from kafka and push it to elasticsearch, but runs otherwise very silently.

@ddyzhang
Copy link

Thanks for getting back to us. I'm assuming logs were disabled in your task definition to verify that it was indeed not the root cause? If not, I definitely recommend trying it just so we can definitively rule it out.

As for general network instability, we have not had any other reports of networking issues with Fargate PV 1.4.0 yet. Enabling VPC Flow Logs could help identify whether or not the request to Kafka is making it out of your application at all.

In the meantime, can you please email me at [email protected] with an example task ID so that I can look into this a bit more? Thanks.

@borfig
Copy link

borfig commented Jul 16, 2021

I believe we see a similar scenario with no logs written to cloudwatch logs for some time and then a logstash crash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fargate PV1.4 Fargate AWS Fargate Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants