You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hannes-ucsc opened this issue
Dec 21, 2024
· 0 comments
Labels
+[priority] High-[priority] Mediumdebt[type] A defect incurring continued engineering costenh[type] New feature or requestinfra[subject] Project infrastructure like CI/CD, build and deployment scriptsorange[process] Done by the Azul teamtest[subject] Unit and integration test code
All CI/CD procedures that queue notifications (make reindex, make index, make integration_test) should fail early if there are messages in the fail queue before the procedure adds more messages to any queue or makes any irreversible changes like purging work queues, deleting indexes and so on. Having empty fail queues is a precondition to these procedures. The same procedures should fail if there are messages in the fail queues after they finish. Having empty fail queues is a post-condition of a successful procedure.
The remedy for such failure is to dump the contents of the fail queues into a file with --delete, and, in the case of a failed precondition, to retry the operation. It may be necessary to reset the indexer before dumping the fail queues.
Every reindex begins with a reset of the indexer. The lambdas are disabled, the work queues are purged, the ES indices are deleted and the lambdas are re-enabled. Note that the fail queues are not purged. A check should be added to the reindex.py script, such that it refuses to even reset the indexer with messages in the fail queue. The error message should suggest the remedy: running manage_queues.py to dump both fail queues into files and retrying the reindex.
On the other end of the reindex, there should be another check of the fail queue lengths. The script, and therefore the make target, should fail unless the lengths are zero.
Determining the length of an SQS used to be an approximate operation. Assignee to check if that's still the case. Receiving a message from a queue and letting the visibility timeout expire might be a more reliable check for emptiness but it tampers with the queue and might interfere with the suggested remedy of dumping the queues. Assignee to investigate this aspect prior to implementing the prescribed solution. Maybe there is now a "peek" operation or maybe the specific length is approximate but whether it is > 0 is not.
The fact that the same principles apply to make integration_test and make reindex suggests that these checks should be implemented in the Azul client so that they can be shared by both procedures.
The text was updated successfully, but these errors were encountered:
achave11-ucsc
added
-
[priority] Medium
enh
[type] New feature or request
debt
[type] A defect incurring continued engineering cost
infra
[subject] Project infrastructure like CI/CD, build and deployment scripts
test
[subject] Unit and integration test code
+
[priority] High
labels
Jan 7, 2025
+[priority] High-[priority] Mediumdebt[type] A defect incurring continued engineering costenh[type] New feature or requestinfra[subject] Project infrastructure like CI/CD, build and deployment scriptsorange[process] Done by the Azul teamtest[subject] Unit and integration test code
Supersedes #2433 and #3661.
No automated process should empty the fail queue.
All CI/CD procedures that queue notifications (
make reindex
,make index
,make integration_test
) should fail early if there are messages in the fail queue before the procedure adds more messages to any queue or makes any irreversible changes like purging work queues, deleting indexes and so on. Having empty fail queues is a precondition to these procedures. The same procedures should fail if there are messages in the fail queues after they finish. Having empty fail queues is a post-condition of a successful procedure.The remedy for such failure is to dump the contents of the fail queues into a file with
--delete
, and, in the case of a failed precondition, to retry the operation. It may be necessary to reset the indexer before dumping the fail queues.Moved here from an older ticket:
The fact that the same principles apply to
make integration_test
andmake reindex
suggests that these checks should be implemented in the Azul client so that they can be shared by both procedures.The text was updated successfully, but these errors were encountered: