Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always retry full-request failures #523

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Dec 6, 2016

We now always retry full-request failures. It makes no sense to stop retrying a bulk request because the bulk API should always return 200. In any other situation we may lose data. Sometimes a full request can legitimately fail if there is a too-large request size from a too-large event. To fix this we now check before sending requests if the size is too large, and have added a new max_request_size option to configure this.

Fixes #522
Fixes #321

@andrewvc
Copy link
Contributor Author

andrewvc commented Dec 6, 2016

Alternate approach. Maybe we don't retry 400s, but instead detect the large requests AOT.

@jordansissel
Copy link
Contributor

If we assume 400 could mean too-large, we could just retry with a smaller request on 400s?

(Detecting too-large requires we know the maximum size of a request body in Elasticsearch before we send the request, which may not be practical to obtain? I'm open to whatever)

@andrewvc
Copy link
Contributor Author

andrewvc commented Dec 6, 2016

@jordansissel this only happens if there is a single event larger than the limit. We already attempt to split up requests in this case, but we obviously cannot subdivide a single event.

@andrewvc
Copy link
Contributor Author

andrewvc commented Dec 6, 2016

The more I think about it the more I think we need a new option max_request_size. We just don't send any request larger than that, and that way we don't need a 'special' 400 case.

In the absence of the too-large-request scenario we should retry 400s indefinitely because that would indicate either:

  1. A bug in the output code
  2. A bug in the user infrastructure, a bad proxy say

In either case they wouldn't want to start dropping events in those cases. A stall would be appropriate.

@jordansissel
Copy link
Contributor

In the absence of the too-large-request scenario we should retry 400s indefinitely

++

@andrewvc andrewvc changed the title Always retry full-request failures except 400. Always retry full-request failures Dec 6, 2016
@andrewvc
Copy link
Contributor Author

andrewvc commented Dec 6, 2016

@jordansissel OK, so I've changed the logic and updated the description per our discussion. Can I get a review from yourself?

@andrewvc
Copy link
Contributor Author

andrewvc commented Dec 6, 2016

@jordansissel is there a plugin-api change I need to make now that we're using the bytes validation?

@andrewvc andrewvc force-pushed the fix-retry-behavior branch 2 times, most recently from 82d814c to b4341de Compare December 9, 2016 19:41
@@ -88,7 +93,7 @@ class LogStash::Outputs::ElasticSearch < LogStash::Outputs::Base
# - A sprintf style string to change the action based on the content of the event. The value `%{[foo]}`
# would use the foo field for the action
#
# For more details on actions, check out the http://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html[Elasticsearch bulk API documentation]
# For more details on actions, check out the http://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html[Elasticsearch bthey are:ulk API documentation]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo inserting 'they are:' in the middle of the word 'bulk' ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, clearly a complete ViM failure :)

#
# This plugin will try to send appropriately sized requests, using multiple
# requests per batch if the events are very large, but if a single event
# is larger than this value sending even that single request will break.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will break" --- it will break what?

# or other things that could account for a slight difference in the HTTTP request, as our size
# check only happens on the request body.
#
# This is specified as a number now, but we should move it to the new 'bytes' type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This is specified ..." will show up in the docs. I recommend not putting this comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put in a note saying that we will change it in the future.

@andrewvc
Copy link
Contributor Author

@jordansissel can we move this forward to LGTM?

# or other things that could account for a slight difference in the HTTTP request, as our size
# check only happens on the request body.
#
# This is specified as a number now, but we will be moved to the new 'bytes' type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a TODO item and we don't want ti showing up in the docs (which it would, in its current location in a comment above this)

@jordansissel
Copy link
Contributor

Once the comment-that-looks-like-a-todo-item thing is removed, LGTM :)

@andrewvc andrewvc dismissed jordansissel’s stale review December 15, 2016 18:50

Addressed your final point Jordan! Thanks for the LGTM!

@andrewvc andrewvc force-pushed the fix-retry-behavior branch 2 times, most recently from fdf3cd7 to 95330c1 Compare December 16, 2016 16:58
We now always retry full-request failures. It makes no sense to
stop retrying a bulk request because the bulk API should always return 200.
In any other situation we may lose data. Sometimes a full request can
legitimately fail if there is a too-large request size from a too-large
event. To fix this we now check before sending requests if the size is
too large, and have added a new max_request_size option to configure
this.

Fixes logstash-plugins#522
@andrewvc
Copy link
Contributor Author

I for the life of me can't repro these travis failures. They ONLY happen on 5.1.1, and they definitely don't happen locally for me :(. The lack of SSH into travis makes this even more perplexing.

I tried some debugging last week, and it appeared that ES wasn't accepting TCP connections anymore on travis!

Bad travis, bad!

@jsvd @jordansissel any ideas here?

@andrewvc
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-retryable response codes from ES will be retried indefinitely Retry logic does the wrong thing
3 participants