Improve handling of fail queues #6781

hannes-ucsc · 2024-12-21T04:27:31Z

Supersedes #2433 and #3661.

No automated process should empty the fail queue.

All CI/CD procedures that queue notifications (make reindex, make index, make integration_test) should fail early if there are messages in the fail queue before the procedure adds more messages to any queue or makes any irreversible changes like purging work queues, deleting indexes and so on. Having empty fail queues is a precondition to these procedures. The same procedures should fail if there are messages in the fail queues after they finish. Having empty fail queues is a post-condition of a successful procedure.

The remedy for such failure is to dump the contents of the fail queues into a file with --delete, and, in the case of a failed precondition, to retry the operation. It may be necessary to reset the indexer before dumping the fail queues.

Moved here from an older ticket:

Every reindex begins with a reset of the indexer. The lambdas are disabled, the work queues are purged, the ES indices are deleted and the lambdas are re-enabled. Note that the fail queues are not purged. A check should be added to the reindex.py script, such that it refuses to even reset the indexer with messages in the fail queue. The error message should suggest the remedy: running manage_queues.py to dump both fail queues into files and retrying the reindex.

On the other end of the reindex, there should be another check of the fail queue lengths. The script, and therefore the make target, should fail unless the lengths are zero.

Determining the length of an SQS used to be an approximate operation. Assignee to check if that's still the case. Receiving a message from a queue and letting the visibility timeout expire might be a more reliable check for emptiness but it tampers with the queue and might interfere with the suggested remedy of dumping the queues. Assignee to investigate this aspect prior to implementing the prescribed solution. Maybe there is now a "peek" operation or maybe the specific length is approximate but whether it is > 0 is not.

The fact that the same principles apply to make integration_test and make reindex suggests that these checks should be implemented in the Azul client so that they can be shared by both procedures.

The text was updated successfully, but these errors were encountered:

hannes-ucsc added the orange [process] Done by the Azul team label Dec 21, 2024

This was referenced Dec 21, 2024

make reindex should fail if it leaves message in fail queues #2433

Closed

IT does not assert that replica bundles are indexed #6647

Open

achave11-ucsc added - [priority] Medium enh [type] New feature or request debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts test [subject] Unit and integration test code + [priority] High labels Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of fail queues #6781

Improve handling of fail queues #6781

hannes-ucsc commented Dec 21, 2024 •

edited

Loading

Improve handling of fail queues #6781

Improve handling of fail queues #6781

Comments

hannes-ucsc commented Dec 21, 2024 • edited Loading

hannes-ucsc commented Dec 21, 2024 •

edited

Loading