Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let certain users allocate "broken" boards as part of a larger job #367

Open
rowleya opened this issue Sep 22, 2021 · 9 comments
Open

Let certain users allocate "broken" boards as part of a larger job #367

rowleya opened this issue Sep 22, 2021 · 9 comments
Labels
enhancement New feature or request spalloc server Relating to the new spalloc server
Milestone

Comments

@rowleya
Copy link
Member

rowleya commented Sep 22, 2021

For debugging, it could be useful to allow certain users (admins?) to allocate boards marked as "broken" as part of a bigger job. They would have to request this specifically of course.

@rowleya rowleya added enhancement New feature or request spalloc server Relating to the new spalloc server labels Sep 22, 2021
@dkfellows
Copy link
Member

I'd give everyone the capability — it's not actively harmful I believe— but it does require adding more queries. That's because the filtering out of “broken” boards is currently baked into the computation of a virtual column that characterises whether the board is available for allocation.

Also, broken boards might not respond to BMP requests correctly so there's that to worry about too!

@rowleya
Copy link
Member Author

rowleya commented Nov 19, 2021

It isn't necessarily harmful to get a machine with a board marked as broken, but it should only happen by default if the allocation would still meet the request i.e. a request for 12 boards probably shouldn't return 12 boards with one broken, but a request for 11 could do that. An admin might then want to allow the return of the 12 boards even with the broken one if that is explicitly asked for; that could also be explicitly asked for by any user though.

If the request includes a board that doesn't do the BMP bit correctly, that is OK too; if the user asked for it, they get what they asked for!

@Christian-B
Copy link
Member

I know easy to ask for harder to do.

if not too hard it would be nice to avoid "broken" boards in the middle of the machine as that is more likely to create routing hotspots around it.

Anyone who specifically asks for a machine with broken boards and then moans something did not work will be laughed at!

@rowleya
Copy link
Member Author

rowleya commented Nov 19, 2021

And of course 0,0 shouldn't be broken!

@dkfellows
Copy link
Member

I'll always require 0,0 to be alive, and a minimum number (determined by the request) of connected boards be alive too; the algorithm won't give you two disconnected chunks in an allocation (unless your request is at least satisfied by one of them I suppose). Both allocating dead boards and avoiding broken boards in the middle of the machine will require different allocation algorithms than the one we currently use.

Right now, we compute whether boards are available for allocation (= not allocated and not broken), and then run a pseudorectangle (because of triads) over the machine to pick candidates for allocation. Then we ensure we have a connected subgraph of boards (rooted at 0,0 in local coordinates) within that rectangle of sufficient actual size. If we do, that's the allocation. The advantage of this approach is that it allows us to allocate a large block of boards even if there's a small existing allocation within that large block; it's treated as if the board is dead from the perspective of the larger allocation. It's this sort of trick that was why I very much wanted to stop using the algorithms that the old spalloc used, and instead move to the relational-set-based ones that SQL makes practical.

These are also among the most complex SQL queries I've ever written. (I've given up on trying to combine them into one; it might be possible, but the idea scares me due to the way I'd need nested aggregations. With the big machine being mostly operational, the connectivity check will pass most of the time anyway.)

@rowleya
Copy link
Member Author

rowleya commented Nov 25, 2021

Wow this sounds complex, but I really like the idea that you can nest an allocation in another! That should allow some quite complex combinations of allocations.

@dkfellows
Copy link
Member

dkfellows commented Nov 25, 2021

Also, if you want to look for yourself, the queries are in their own files (the Java code to connect them together is mostly straightforward)

  1. SpiNNaker-allocserv/src/main/resources/queries/find_rectangle.sql — does the initial search
  2. SpiNNaker-allocserv/src/main/resources/queries/allocation_connected.sql — checks the connected size
  3. SpiNNaker-allocserv/src/main/resources/queries/connected_boards_at_coords.sql — produces the list of boards that the client sees

The sequence:

WITH
	-- Name the arguments for sanity
	args(width, height, machine_id, max_dead_boards) AS (
		VALUES (:width, :height, :machine_id, :max_dead_boards)),

is just so that I can work with proper named arguments and JDBC, which is strictly positional (unlike with result sets); the :name stuff is really a lie but it's easier to read than ?. It reduces the insanity.

@rowleya
Copy link
Member Author

rowleya commented Nov 25, 2021

I should add that trying to put broken boards on the edge of the machine is non-critical (and clearly doesn't work as well with the idea of nesting allocations).

@dkfellows
Copy link
Member

I'm not sure how to even express that in relational algebra. 😄

@dkfellows dkfellows added this to the Bluesky milestone Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request spalloc server Relating to the new spalloc server
Projects
None yet
Development

No branches or pull requests

3 participants