Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

policy: expose internals of resource module policy factory to allow custom policies #1268

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

wihobbs
Copy link
Member

@wihobbs wihobbs commented Aug 13, 2024

Problem: the current fluxion policy matching factory doesn't allow for customizable policies.

Add a custom option and refactor existing policies to behave similarly. Add unit testing for the string parsing convenience functions.

Outstanding to-do:

  • add an option in the scheduler config to actually pass a custom string in, and probably some validation too

@wihobbs
Copy link
Member Author

wihobbs commented Aug 13, 2024

And I know the testing and the code should be in separate commits... sigh (will fix)

@wihobbs wihobbs force-pushed the new-policy-refactor branch from f7f1df3 to cd87d59 Compare August 13, 2024 01:09
Copy link
Member

@cmoussa1 cmoussa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this @wihobbs! I have just some preliminary questions/comments on a first pass. Apologies in advance that I'm not very familiar with the sched code at all so I could definitely be misunderstanding how this works. :-)

resource/policies/dfu_match_policy_factory.cpp Outdated Show resolved Hide resolved
resource_type_t node_rt ("node");
std::shared_ptr<dfu_match_cb_t> create_match_cb (const std::string &policy_requested)
{
std::string policy = policies.find (policy_requested)->second;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this .find () call have any sort of error checking after the lookup of policy_requested? Maybe this is always guaranteed to find something successfully. If not, maybe you can check the contents of policy before moving on.

if (policy == "")
    // return an error?
    // set matcher to something default?
    // sorry that idk what the right behavior should be here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm good catch. It should call known_match_policy first to see if the policy exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, known_match_policy is called when the config is validated in resource/modules/resource_match_opts.cpp so it's already been validated before it gets here.

This will need additional work to support custom.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, a comment making note that it is already validated by the time it gets to that lookup might be helpful here!


if (option_exists ("stop_on_k_matches", policy)) {
ptr->set_stop_on_k_matches (parse_int_match_options ("stop_on_k_matches", policy));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this if-else branch need an else statement? Excuse my ignorance on this - I'm not sure if the addition of custom policies allows for something other than low or high

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK low and high are the only options, unless locality or variation are chosen.

@wihobbs wihobbs force-pushed the new-policy-refactor branch 2 times, most recently from 5f03ae3 to ba80f7b Compare August 13, 2024 16:31
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick look-through, it looks reasonable to me. I think the commits could be broken up better though (as you noted already)

bool known_match_policy (const std::string &policy);

const std::map<std::string, std::string> policies =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a deeper look over this, but first thing this should be split. Put the declaration here with extern and the definition in the cpp file so this variable doesn't end up getting defined in all includers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, that's what the old code did, but it shouldn't 😬

@garlick
Copy link
Member

garlick commented Aug 20, 2024

PR title seems truncated?

@wihobbs
Copy link
Member Author

wihobbs commented Aug 20, 2024

Well, custom is a new type of policy that this PR is enabling. Or will enable eventually. Maybe "resource: refactor policy options to allow a user-defined policy" is more descriptive?

Note that this is still WIP and needs some config work and testing for the custom options. Happy to incorporate any feedback on the initial draft (and thanks to those who have already reviewed!), but it's going to stay in draft state until the code can:

  • accept a custom string with policy options
  • reject certain strings that are requesting invalid custom options, such as set_stop_on_k_matches=-100 and such
  • test some custom options and make sure they behave as expected
  • break out the commit history in a better way

@wihobbs
Copy link
Member Author

wihobbs commented Sep 5, 2024

Per discussion with Tom:

  • To handle custom policies, "don't validate, parse" -- parse the input on a delimiter first and then verify that each key is valid, give an error message if it isn't, possibly demote this to a warning based on a config option
  • Fluxion should issue a fatal warning if a policy named is not valid -- don't fall back on first (might already be done?)
  • set_stop_on_k_matches probably only takes a value of 1, however, we should test this
  • we should also think about having "node_centric" have a score factor of 10000, maybe we don't want to allow users to configure this, but maybe that number doesn't cover all our bases

@wihobbs wihobbs force-pushed the new-policy-refactor branch from ba80f7b to 2d176c4 Compare November 15, 2024 17:25
@wihobbs wihobbs force-pushed the new-policy-refactor branch 3 times, most recently from 147eebf to 3484d1c Compare December 23, 2024 18:07
@wihobbs wihobbs requested a review from jameshcorbett January 16, 2025 18:19
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Just noticed one thing.

resource/policies/dfu_match_policy_factory.hpp Outdated Show resolved Hide resolved
@grondo
Copy link
Contributor

grondo commented Jan 16, 2025

Don't forget to address @garlick's comment about the PR title, which will end up as a line in the release notes.

@wihobbs
Copy link
Member Author

wihobbs commented Jan 16, 2025

I just asked @jameshcorbett to do a first pass to get me started. Note this still is in draft state until it has working sharness tests and additional reviews.

@wihobbs wihobbs force-pushed the new-policy-refactor branch from 3484d1c to 2efc97e Compare January 22, 2025 21:54
@wihobbs wihobbs changed the title policy: refactor options to allow custom policy: expose internals of resource module policy factory to allow custom policies Jan 22, 2025
@wihobbs wihobbs force-pushed the new-policy-refactor branch from 2efc97e to bdff4d5 Compare January 22, 2025 21:57
@wihobbs
Copy link
Member Author

wihobbs commented Jan 22, 2025

Once the automated tests pass I'll drop draft from this PR.

@wihobbs wihobbs marked this pull request as ready for review January 22, 2025 22:48
@wihobbs wihobbs requested a review from jameshcorbett January 22, 2025 22:48
@trws
Copy link
Member

trws commented Jan 23, 2025

This is starting to look really good @wihobbs! There's one tweak I'd like to see to this so we can keep extending it. I like the way the options are handled, this is exactly what I was hoping for. The one tweak I'd make is to have a policy=* option rather than separate low= and high=, since the base policy object types are low, high, locality and variation. That way, we get the "you can't set both low and high at the same time" for free, and can even collapse the two low and high branches to create a matcher based on which it is then set options on the resulting object after the condition. We can also check that policy has been set, either with the bare name or by having policy=whatever to allow options. Does that sound workable?

@wihobbs wihobbs force-pushed the new-policy-refactor branch 3 times, most recently from 5628d63 to 7e72499 Compare January 31, 2025 01:06
@wihobbs wihobbs force-pushed the new-policy-refactor branch from 7e72499 to 62b7cf4 Compare January 31, 2025 01:27
@wihobbs
Copy link
Member Author

wihobbs commented Jan 31, 2025

I think I've incorporated this feedback @trws, though slightly differently than you laid out in your comment. Instead of a policy key which accepts locality, variation, high, and low, I added a policy key that will accept low or high, and you can still set match-policy to locality, variation, high, and low, in which case the latter two would result in a "vanilla" high or low policy without the extra options. Does that suffice for what you laid out? It turned out to be not a huge lift.

I haven't implemented allowing set_stop_on_k_matches to accept an integer yet, but I'm going to go ahead and request a few reviews on this, to get feedback on the code and start getting the commits polished for approval. If you're okay with kicking that down the road to another PR (apologies, I forgot what we decided on that in our conversation yesterday), we could do that when the option itself supports integers other than 1.

@wihobbs wihobbs requested review from trws and cmoussa1 January 31, 2025 01:28
Copy link
Member

@cmoussa1 cmoussa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good @wihobbs!! I've left some general feedback; I should probably defer to someone who knows the sched code better than I do to actually approve this but I think this looks pretty good to me!

t/CMakeLists.txt Show resolved Hide resolved
resource/policies/dfu_match_policy_factory.cpp Outdated Show resolved Hide resolved
resource/policies/dfu_match_policy_factory.cpp Outdated Show resolved Hide resolved
resource/policies/dfu_match_policy_factory.cpp Outdated Show resolved Hide resolved
Problem: policies in fluxion are sets of options that are
specified in a shared pointer, and, when a new set of options
(a new policy) is created, this requires at least a re-compilation
of flux-sched. In other words, the internal options on which policies
are built are not exposed to end users for creating new, custom policies.

Solution: continue to support the existing policy options, but make them
sets of specified options, and allow users to create their own sets and
pass them to the scheduler as a string.
Problem: the changes to the policy factory do not have
sharness testing.

Solution: building from existing, validated outputs, provide
the full string for certain policy options (rather than the
short string) and check that the matches are the same.
Problem: the policy factory in the resource module has no
unit testing.

Solution: add a series of unit tests to validate the parsing
and selection of policy options in the factory.
Problem: currently, the resource module sets a default policy
(first) when an invalid policy is specified. With the expansion
of the resource module to include custom policies, users might
be less accepting of a default policy being forced on them.

Emit an error when an invalid policy is specified in the
match-policy key. Exit with EINVAL.
Problem: the current testsuite includes two separate tests
that assert that the sched-fluxion-resource module will not emit
an error and abort if an invalid policy is specified in its
configuration. However, this behavior is undesirable, especially
since users will soon be able to specify custom policies and should
get an error if they specify an invalid one.

Amend these tests to test for failure in the cases where an invalid
policy is specified.
Problem: the example config files for tests in t/conf.d includes
an example of invalid policy settings, both for match-policy in
resource and queue-policy in fluxion. This configuration is only used
in t1005, where qmanager is tested to ensure it tolerates an invalid
queue policy and defaults to fcfs.

Recently, the resource module was updated to not tolerate invalid
policies, but this shouldn't change the behavior of qmanager. Use
first in the sched-fluxion-resource toml file example for this
test.
@wihobbs wihobbs force-pushed the new-policy-refactor branch from 62b7cf4 to 1ced65b Compare January 31, 2025 18:39
@wihobbs
Copy link
Member Author

wihobbs commented Jan 31, 2025

Thanks @cmoussa1! Fixed and pushed.

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 96.73913% with 3 lines in your changes missing coverage. Please review.

Project coverage is 75.4%. Comparing base (471da50) to head (1ced65b).

Files with missing lines Patch % Lines
resource/policies/dfu_match_policy_factory.cpp 95.0% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1268     +/-   ##
========================================
+ Coverage    75.3%   75.4%   +0.1%     
========================================
  Files         111     112      +1     
  Lines       16042   16103     +61     
========================================
+ Hits        12081   12146     +65     
+ Misses       3961    3957      -4     
Files with missing lines Coverage Δ
resource/modules/resource_match_opts.cpp 84.4% <100.0%> (+0.3%) ⬆️
...licies/base/test/matcher_policy_factory_test02.cpp 100.0% <100.0%> (ø)
resource/policies/dfu_match_policy_factory.cpp 90.6% <95.0%> (+2.0%) ⬆️

... and 2 files with indirect coverage changes

*/
if (std::find (policy_options.begin (), policy_options.end (), settings.at (0))
== policy_options.end ()) {
std::cerr << "invalid policy option: " << settings.at (0) << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you should probably avoid writing directly to stdout/err from a broker module.

If there's a way to throw an exception and catch the error message at a higher level, which would then flux_log (LOG_ERR, ...) the result, that might be good. Otherwise you'd have to pass down a flux_t * handle to this function so it can log errors.

You could also return an error via a separate parameter when the result is false. Some of the other Fluxion developers may have a preference here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a chat about this yesterday after the meeting. Just above this in the call chain we're passing a std::string through by reference and appending messages to it as necessary, then returning it up to where we can handle it. The trick is we can't always have a handle here, because this is used in both the module and in resource_query. Essentially the separate parameter option. Example of the plan on line 516 in resource_match_opts.

I have a separate issue about doing a better job of this in fluxion as a whole, but don't want it to block this PR, so plan is to pass that output string down through for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's a hole in this PR. Will fix with what we discussed. Sorry, I should have done that before marking it as "ready." I got excited [and maybe a little ahead of myself] when the testsuite finally passed 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants