Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of fail queues #6781

Open
hannes-ucsc opened this issue Dec 21, 2024 · 0 comments
Open

Improve handling of fail queues #6781

hannes-ucsc opened this issue Dec 21, 2024 · 0 comments
Labels
+ [priority] High - [priority] Medium debt [type] A defect incurring continued engineering cost enh [type] New feature or request infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team test [subject] Unit and integration test code

Comments

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Dec 21, 2024

Supersedes #2433 and #3661.

No automated process should empty the fail queue.

All CI/CD procedures that queue notifications (make reindex, make index, make integration_test) should fail early if there are messages in the fail queue before the procedure adds more messages to any queue or makes any irreversible changes like purging work queues, deleting indexes and so on. Having empty fail queues is a precondition to these procedures. The same procedures should fail if there are messages in the fail queues after they finish. Having empty fail queues is a post-condition of a successful procedure.

The remedy for such failure is to dump the contents of the fail queues into a file with --delete, and, in the case of a failed precondition, to retry the operation. It may be necessary to reset the indexer before dumping the fail queues.

Moved here from an older ticket:

Every reindex begins with a reset of the indexer. The lambdas are disabled, the work queues are purged, the ES indices are deleted and the lambdas are re-enabled. Note that the fail queues are not purged. A check should be added to the reindex.py script, such that it refuses to even reset the indexer with messages in the fail queue. The error message should suggest the remedy: running manage_queues.py to dump both fail queues into files and retrying the reindex.

On the other end of the reindex, there should be another check of the fail queue lengths. The script, and therefore the make target, should fail unless the lengths are zero.

Determining the length of an SQS used to be an approximate operation. Assignee to check if that's still the case. Receiving a message from a queue and letting the visibility timeout expire might be a more reliable check for emptiness but it tampers with the queue and might interfere with the suggested remedy of dumping the queues. Assignee to investigate this aspect prior to implementing the prescribed solution. Maybe there is now a "peek" operation or maybe the specific length is approximate but whether it is > 0 is not.

The fact that the same principles apply to make integration_test and make reindex suggests that these checks should be implemented in the Azul client so that they can be shared by both procedures.

@hannes-ucsc hannes-ucsc added the orange [process] Done by the Azul team label Dec 21, 2024
@achave11-ucsc achave11-ucsc added - [priority] Medium enh [type] New feature or request debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts test [subject] Unit and integration test code + [priority] High labels Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High - [priority] Medium debt [type] A defect incurring continued engineering cost enh [type] New feature or request infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team test [subject] Unit and integration test code
Projects
None yet
Development

No branches or pull requests

2 participants