You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apologies in advance for a relatively vague issue description but unfortunately even with debug logging it is difficult to shed a light on this.
Basically we have been running logstash (latest stable) on fargate 1.3.0 with aws-msk (kafka) as input (=consumer) for a couple of months with not a single issue - no crashes, including autoscaling triggering based on CPU usage.
As soon as we switched to 1.4.0 though, we started noticing logstash suddenly and unfortunately silently crashing. E.g. logstash as a container/process is still running, health checks are fine, but its was not consuming any logs anymore.
Upon further inspection we can unfortunately not see anything in the logs, however we can tell that the connection to kafka is not anymore active.
Switching back to 1.3.0 without any further changes brought back the stability (running for over a week without a single "crash").
In my limited visibility I am concluding, and in conjunction with: logstash-plugins/logstash-integration-kafka#15, that the connection to kafka is maybe dropping unexpectedly which logstash at that point is unable to catch.
Is there for some reason a possibility that between 1.3.0 and 1.4.0 a connection timeout existing, or are there any other changes that might interfere with kafka in any way?
We have been running other applications that mostly have short-lived connections (including a kafka producer) on 1.4.0 for weeks without any issues.
The text was updated successfully, but these errors were encountered:
The Fargate team is currently tracking a known issue where customer applications could hang/become unresponsive when the application is pushing a large volume of logs to the logging driver configured in the ECS task definition. If possible, can you please try disabling logging completely in your ECS task definition and try running on PV 1.4.0 again?
Thanks @ddyzhang we aware of that issue, however the logstash ecs service itself hardly pushes any logs to cloudwatch logs. It will consume logs from kafka and push it to elasticsearch, but runs otherwise very silently.
Thanks for getting back to us. I'm assuming logs were disabled in your task definition to verify that it was indeed not the root cause? If not, I definitely recommend trying it just so we can definitively rule it out.
As for general network instability, we have not had any other reports of networking issues with Fargate PV 1.4.0 yet. Enabling VPC Flow Logs could help identify whether or not the request to Kafka is making it out of your application at all.
In the meantime, can you please email me at [email protected] with an example task ID so that I can look into this a bit more? Thanks.
Apologies in advance for a relatively vague issue description but unfortunately even with debug logging it is difficult to shed a light on this.
Basically we have been running logstash (latest stable) on fargate 1.3.0 with aws-msk (kafka) as input (=consumer) for a couple of months with not a single issue - no crashes, including autoscaling triggering based on CPU usage.
As soon as we switched to 1.4.0 though, we started noticing logstash suddenly and unfortunately silently crashing. E.g. logstash as a container/process is still running, health checks are fine, but its was not consuming any logs anymore.
Upon further inspection we can unfortunately not see anything in the logs, however we can tell that the connection to kafka is not anymore active.
Switching back to 1.3.0 without any further changes brought back the stability (running for over a week without a single "crash").
In my limited visibility I am concluding, and in conjunction with: logstash-plugins/logstash-integration-kafka#15, that the connection to kafka is maybe dropping unexpectedly which logstash at that point is unable to catch.
Is there for some reason a possibility that between 1.3.0 and 1.4.0 a connection timeout existing, or are there any other changes that might interfere with kafka in any way?
We have been running other applications that mostly have short-lived connections (including a kafka producer) on 1.4.0 for weeks without any issues.
The text was updated successfully, but these errors were encountered: