[Fargate] [regression]: potential network instability in 1.4.0 #992

ghost · 2020-07-21T21:38:03Z

Apologies in advance for a relatively vague issue description but unfortunately even with debug logging it is difficult to shed a light on this.

Basically we have been running logstash (latest stable) on fargate 1.3.0 with aws-msk (kafka) as input (=consumer) for a couple of months with not a single issue - no crashes, including autoscaling triggering based on CPU usage.

As soon as we switched to 1.4.0 though, we started noticing logstash suddenly and unfortunately silently crashing. E.g. logstash as a container/process is still running, health checks are fine, but its was not consuming any logs anymore.

Upon further inspection we can unfortunately not see anything in the logs, however we can tell that the connection to kafka is not anymore active.

Switching back to 1.3.0 without any further changes brought back the stability (running for over a week without a single "crash").

In my limited visibility I am concluding, and in conjunction with: logstash-plugins/logstash-integration-kafka#15, that the connection to kafka is maybe dropping unexpectedly which logstash at that point is unable to catch.

Is there for some reason a possibility that between 1.3.0 and 1.4.0 a connection timeout existing, or are there any other changes that might interfere with kafka in any way?

We have been running other applications that mostly have short-lived connections (including a kafka producer) on 1.4.0 for weeks without any issues.

ddyzhang · 2020-07-22T19:01:45Z

Hi @guntergt

The Fargate team is currently tracking a known issue where customer applications could hang/become unresponsive when the application is pushing a large volume of logs to the logging driver configured in the ECS task definition. If possible, can you please try disabling logging completely in your ECS task definition and try running on PV 1.4.0 again?

Thank you.

ghost · 2020-07-22T22:15:04Z

Thanks @ddyzhang we aware of that issue, however the logstash ecs service itself hardly pushes any logs to cloudwatch logs. It will consume logs from kafka and push it to elasticsearch, but runs otherwise very silently.

ddyzhang · 2020-07-23T20:59:22Z

Thanks for getting back to us. I'm assuming logs were disabled in your task definition to verify that it was indeed not the root cause? If not, I definitely recommend trying it just so we can definitively rule it out.

As for general network instability, we have not had any other reports of networking issues with Fargate PV 1.4.0 yet. Enabling VPC Flow Logs could help identify whether or not the request to Kafka is making it out of your application at all.

In the meantime, can you please email me at [email protected] with an example task ID so that I can look into this a bit more? Thanks.

borfig · 2021-07-16T09:08:12Z

I believe we see a similar scenario with no logs written to cloudwatch logs for some time and then a logstash crash

ghost added the Proposed Community submitted issue label Jul 21, 2020

akshayram-wolverine added Fargate AWS Fargate Fargate PV1.4 labels Jul 22, 2020

PettitWesley mentioned this issue Jul 25, 2020

Broken pipe error sending logs to datadog fluent/fluent-bit#2387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fargate] [regression]: potential network instability in 1.4.0 #992

[Fargate] [regression]: potential network instability in 1.4.0 #992

ghost commented Jul 21, 2020

ddyzhang commented Jul 22, 2020

ghost commented Jul 22, 2020 •

edited by ghost

Loading

ddyzhang commented Jul 23, 2020

borfig commented Jul 16, 2021

[Fargate] [regression]: potential network instability in 1.4.0 #992

[Fargate] [regression]: potential network instability in 1.4.0 #992

Comments

ghost commented Jul 21, 2020

ddyzhang commented Jul 22, 2020

ghost commented Jul 22, 2020 • edited by ghost Loading

ddyzhang commented Jul 23, 2020

borfig commented Jul 16, 2021

ghost commented Jul 22, 2020 •

edited by ghost

Loading