Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job hangin QUEUED when bmp fails #1155

Open
Christian-B opened this issue May 8, 2024 · 1 comment · May be fixed by #1213
Open

Job hangin QUEUED when bmp fails #1155

Christian-B opened this issue May 8, 2024 · 1 comment · May be fixed by #1213

Comments

@Christian-B
Copy link
Member

We had boards/cabinets where the BMP command failed

Jobs get allocated here but BMPSendTimedOutException (see log)

Job hangs in QUEUED

Found in /home/spalloc/spalloc.log on https://spinnaker.cs.man.ac.uk/
2024-05-05 07:11:33.787 INFO 1176 --- [ThreadPoolTaskScheduler16] u.a.m.s.a.a.AllocatorTask : Job 452535 changes resulted in errors.
2024-05-05 07:11:36.799 ERROR 1176 --- [ThreadPoolTaskScheduler-8] u.a.m.s.a.b.BMPController : Requests failed on BMP 357

uk.ac.manchester.spinnaker.transceiver.ProcessException: when sending to 0:0:13, received exception: uk.ac.manchester.spinnaker.transceiver.BMPSendTimedOutException
with message: Operation CMD_VER (GetBMPVersion(command=CMD_VER, sequence=51774, argument1=0, argument2=0, argument3=0)) timed out after 0.750000 seconds
at uk.ac.manchester.spinnaker.transceiver.ProcessException.makeInstance(ProcessException.java:116) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.finish(BMPCommandProcess.java:464) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess.call(BMPCommandProcess.java:164) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.transceiver.Transceiver.get(Transceiver.java:1725) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.transceiver.Transceiver.readBMPVersion(Transceiver.java:1839) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.transceiver.BMPTransceiverInterface.readBMPVersion(BMPTransceiverInterface.java:859) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.SpiNNaker1.canBoardManageFPGAs(SpiNNaker1.java:212) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.SpiNNaker1.setLinkOff(SpiNNaker1.java:228) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$PowerRequest.changeBoardPowerState(BMPController.java:502) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$PowerRequest.lambda$tryProcessRequest$10(BMPController.java:621) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$Request.bmpAction(BMPController.java:279) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$PowerRequest.tryProcessRequest(BMPController.java:620) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$Request.processRequest(BMPController.java:384) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$Worker.run(BMPController.java:1079) ~[classes!/:?]
at uk.ac.manchester.spinnaker.alloc.bmp.BMPController.lambda$triggerSearch$4(BMPController.java:226) ~[classes!/:?]
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.3.30.jar!/:5.3.30]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: uk.ac.manchester.spinnaker.transceiver.BMPSendTimedOutException: Operation CMD_VER (GetBMPVersion(command=CMD_VER, sequence=51774, argument1=0, argument2=0, argument3=0)) timed out after 0.750000 seconds
at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.resend(BMPCommandProcess.java:530) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.handleReceiveTimeout(BMPCommandProcess.java:515) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]
at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.finish(BMPCommandProcess.java:456) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]

@rowleya
Copy link
Member

rowleya commented May 8, 2024

Note that the "stuck in queued" appears to be that after the failure, the same board is again tried, and this repeats. Ideally a board that is attempted and fails is marked as having been allocated to avoid this repetition. Even more ideally, the board would be disabled after a number of failures, and an admin emailed for evaluation.

@rowleya rowleya moved this to Todo in EBRAINS 2.0 Project Jan 10, 2025
@rowleya rowleya linked a pull request Jan 15, 2025 that will close this issue
@rowleya rowleya moved this from Todo to In Review in EBRAINS 2.0 Project Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

2 participants