[WIP] Add grow and shrink functions to REAPI #1316

milroy · 2024-12-04T07:16:16Z

This PR exposes grow functions in the C and C++ REAPI bindings.

The grow functionality passes a JGF subgraph including the path from the cluster vertex to the subgraph root. For example, a JGF subgraph with a new node (newnode) and subnode resources to cluster0 at rack0 includes the cluster and rack vertices as well as the induced edges. The disadvantage of this approach is that the vertex metadata in the JGF subgraph needs to be sufficiently specified to identify the vertices that already exist in the graph (e.g., cluster0).

The PR is WIP, because we'll need to sort out whether to implement the REAPI module functions and determine if it's preferable to pass a path to the attachment point rather than include the path in the JGF subgraph.

codecov · 2025-01-15T02:33:45Z

Codecov Report

Attention: Patch coverage is 2.98507% with 65 lines in your changes missing coverage. Please review.

Project coverage is 75.0%. Comparing base (5ae7459) to head (10f0f74).

Files with missing lines	Patch %	Lines
resource/reapi/bindings/c++/reapi_cli_impl.hpp	0.0%	65 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #1316     +/-   ##
========================================
- Coverage    75.3%   75.0%   -0.4%     
========================================
  Files         111     111             
  Lines       16042   16109     +67     
========================================
+ Hits        12081   12083      +2     
- Misses       3961    4026     +65

Files with missing lines	Coverage Δ
resource/readers/resource_reader_jgf.cpp	`71.0% <ø> (ø)`
resource/utilities/command.cpp	`77.8% <100.0%> (+<0.1%)`	⬆️
resource/reapi/bindings/c++/reapi_cli_impl.hpp	`34.1% <0.0%> (-5.5%)`	⬇️

vsoch · 2025-01-15T02:38:28Z

Flux in Kubernetes has told me "Yeaaaah man, I'm gonna get YUUUUGE!"

vsoch · 2025-01-15T03:30:04Z

This is tested (and the basics are working in fluxion-go)! 🥳

We start with cluster tiny, one rack and two nodes (node0 and node1). We demonstrate that if we ask for 4 nodes, we cannot be satisfied:

Asking to MatchSatisfy 4 nodes (not possible)

        ----Match Satisfy output---
satisfied: false
error: <nil>

We then ask fluxion to grow from 2 to 4 nodes. We do that with a new nodes request that includes an existing path in the graph, tiny0->rack0 but then defines two new nodes, node2 and node3.

🍔 Asking to Grow from 2 to 4 Nodes
Grow request return value: <nil>

We can get some verification that the graph node has node0-3 (4 nodes) by asking for 4 nodes with satisfy again. We are satisfied!

Asking to MatchSatisfy 4 nodes (now IS possible)

        ----Match Satisfy output---
satisfied: true
error: <nil>

We now want to test shrink. Shrink takes the node path that we should prune at. For this case, we just prune off one node.

🥕 Asking to Shrink from 4 to 3 Nodes
Shrink request return value: <nil>

When we have 3 nodes we ask for 4 again, and we are no longer satisfied.

Asking to MatchSatisfy 4 nodes (again, not possible)

        ----Match Satisfy output---
satisfied: false
error: <nil>

The testing shown above is here in GitHub CI (shown across OSes) and that full PR is here and will just need to be updated when the branch here is merged (or we can work from this branch if that isn't going to happen soon - I can build/deploy a custom container that has it.

How will shrink work in Kubernetes?

High level - I'm thinking through the shrink design for a cluster in Kubernetes, and I think we have two use cases (that warrant different design strategies):

1. A need to shrink down in increments of 1

This will work fine to prune single nodes.

2. A need to shrink down in unknown increments (>1)

This could be a lot of requests to fluxion, for example, if we want to shrink down by 10, 20, or more nodes at once. We have a few options, I think. I'll think through each one.

We can expose a function in fluxion-go that takes a list of nodes, and then (within the same call to fluxion-go) makes multiple calls to fluxion. That mostly replaces the need to do many grpc requests (one per node) with just one (to handle many requests, internally).
If we have an understanding of the increments, we can design a cluster graph with abstract levels of racks, where each rack is a group of nodes that are intended to be brought down together. Maybe we would design that graph based on topology. That way, we can do one request to trim the rack, and it will cut off all the nodes.

I haven't looked at what fluxion is doing in terms of the actual shrink (happening during a traversal?) but if there is an operation that can handle multiple cuts at once (e.g., done during one traversal) that could be an idea. But based on my impression of infrequent scaling, I don't really think optimizing this up the wazoo right now is that necessary.

How will grow design work in Kubernetes?

For the first case (a flat topology that has nodes added to it) we likely just need to get the highest node identifier in the graph, and then we generate the json request that adds 1 to that (and however many we need). That can be stored as a variable somewhere, and on restart it can be calculated again from the live cluster.

For the second case that has multiple racks each with children, we would likely apply the same strategy as above, but on the level of the rack, and then the number of children under a rack is constant. We would calculate the node indices based on the number of racks and expected nodes per each. In the case that we have different numbers of nodes per rack (e.g., imagine different applications requiring different sized increments) we could label the rack with metadata that says exactly the number of children that are there.

Anyway - there are many ways to skin an avocado! Thanks for finishing this up @milroy I'm super pumped to see it working, and (TBA) to get it merged and deployed and into some of our projects!

milroy added the Status: In Progress label Dec 4, 2024

milroy force-pushed the grow-api branch from 9c1eb40 to 3d569fe Compare December 12, 2024 01:02

milroy force-pushed the grow-api branch 2 times, most recently from 33d234c to 1c85714 Compare January 15, 2025 00:59

reader: update JGF reader comment

f1a0c5d

milroy force-pushed the grow-api branch from 1c85714 to fa25e97 Compare January 15, 2025 01:15

milroy added 2 commits January 14, 2025 17:20

reapi: fix C API error message output

1cc7d50

reapi: add grow function to C API

727f28d

milroy force-pushed the grow-api branch 2 times, most recently from df1e228 to 10f0f74 Compare January 15, 2025 02:27

milroy changed the title ~~[WIP] Add grow functions to REAPI~~ [WIP] Add grow and shrink functions to REAPI Jan 15, 2025

milroy added 4 commits January 15, 2025 14:30

reapi: add C++ API grow functionality

e9cc58e

resource-query: reinitialize the traverser after attach

0550f53

reapi: add shrink function to C API

c24268a

reapi: add shrink functionality to C++ API

c614801

milroy force-pushed the grow-api branch from 10f0f74 to c614801 Compare January 15, 2025 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add grow and shrink functions to REAPI #1316

[WIP] Add grow and shrink functions to REAPI #1316

milroy commented Dec 4, 2024

codecov bot commented Jan 15, 2025

vsoch commented Jan 15, 2025

vsoch commented Jan 15, 2025

[WIP] Add grow and shrink functions to REAPI #1316

Are you sure you want to change the base?

[WIP] Add grow and shrink functions to REAPI #1316

Conversation

milroy commented Dec 4, 2024

codecov bot commented Jan 15, 2025

Codecov Report

vsoch commented Jan 15, 2025

vsoch commented Jan 15, 2025

How will shrink work in Kubernetes?

1. A need to shrink down in increments of 1

2. A need to shrink down in unknown increments (>1)

How will grow design work in Kubernetes?