Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation warnings set stacklevel=2 #377

Closed
wants to merge 10 commits into from

Conversation

bmcfee
Copy link
Collaborator

@bmcfee bmcfee commented Mar 26, 2024

Fixes #372

All validation functions now set stacklevel=2.

Before, line number comes from validate:

n [4]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:93: UserWarning: Estimated beats are empty.
  warnings.warn("Estimated beats are empty.")

After, line number comes from the metrics:

In [9]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:165: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:207: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:267: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:359: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:454: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:614: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)

As it currently stands, we do a lot of redundant validation when using the evaluate() functions. It might be worth considering an enhancement where evaluate() validates the data once at the top, and then each underlying metric provides an option to bypass validation thereafter.

I've proposed similar things in the past #341 as an optimization hack, but in this case, it's now also a usability issue because the relevant stacklevel to warn is different when metrics are called directly (2) vs called from evaluate (3).

I think it could make sense to roll that into this PR, or not. Curious for other folks' input.

@bmcfee bmcfee added this to the 0.8 milestone Mar 26, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 97.88136% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 88.65%. Comparing base (7997fdf) to head (fd2b880).
Report is 5 commits behind head on main.

Files Patch % Lines
mir_eval/multipitch.py 50.00% 4 Missing ⚠️
mir_eval/chord.py 98.41% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #377      +/-   ##
==========================================
+ Coverage   88.31%   88.65%   +0.34%     
==========================================
  Files          19       19              
  Lines        2875     2971      +96     
==========================================
+ Hits         2539     2634      +95     
- Misses        336      337       +1     
Flag Coverage Δ
unittests 88.65% <97.88%> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 26, 2024

... of course, @craffel already suggested this 3 years ago 😆

#341 (comment)

@craffel
Copy link
Collaborator

craffel commented Mar 26, 2024

Thanks. What does this look like when you call a metric function directly? Regarding redundant calls to validate, this probably belongs in the other issue, but there must be some Python wizardry that basically says: If the same function is called with the same arguments within this context, skip the call. Basically some kind of argument caching/memoization but instead of caching the output, just skip the call completely, and evict the cache when you exit the context. That would work, right?

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 26, 2024

Took the liberty of adding the safe= parameter to validators and evaluator functions. This is WIP, I'm up through (alphabetically) the segment module. TBD are separation, tempo, transcription, and transcription_velocity.

In all cases, the default behavior is unchanged. Everything is validated by default. However, we now avoid doing multiple validation calls on the same data. Doing it this way means that stacklevel=2 always kicks back to the call site of validation, regardless of whether the user called evaluate() or a more specific metric.

I also took the liberty of bumping up the chord parser regexp into a module-level compiled pattern. Recompiling that monster regexp for every chord label was probably the source of the inefficiency I mentioned back in #341.

I'll report back when all safety parameterizations are done.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 26, 2024

Thanks. What does this look like when you call a metric function directly? Regarding redundant calls to validate, this probably belongs in the other issue, but there must be some Python wizardry that basically says: If the same function is called with the same arguments within this context, skip the call. Basically some kind of argument caching/memoization but instead of caching the output, just skip the call completely, and evict the cache when you exit the context. That would work, right?

Hah, sorry, crossed wires here. 😁

The idea is that if the user calls evaluate(..., safe=True), the validation happens once there, and is then subsequently skipped (safe=False). This way we never have redundant calls, and the stacklevel points to the right place.

If the user calls a metric directly with safe=True, it still validates.

If the user calls either with safe=False, then validation is skipped entirely. (Dangerous, but YOLO, and it makes sense in certain circumstances.)

Regarding memoization, yeah we could do that, but I don't think it actually buys us much here. It would add at least one dependency (eg joblib) and probably complicate the stacklevel logic more than the solution above.

@craffel
Copy link
Collaborator

craffel commented Mar 26, 2024

This works, but doesn't it seem drastic to put a new arg into basically all of the functions just to avoid redundant calls in evaluate? It feels possibly worth it to do the Python wizardry to avoid adding an arg everywhere.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 26, 2024

This works, but doesn't it seem drastic to put a new arg into basically all of the functions just to avoid redundant calls in evaluate? It feels possibly worth it to do the Python wizardry to avoid adding an arg everywhere.

I wouldn't consider it drastic if it's implemented consistently across the entire package (which is my plan). It's perhaps a little clumsy, but it's easy to understand and use.

A slick pythonic alternative would be to bundle everything in a decorator that abstracts out the validation / bypass logic before getting onto the core of the evaluation code. This might be a little fancier on the dev side, but the user experience would be pretty much the same. In this case, as much as I like a slick decorator trick, I'd favor readability and transparency.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 28, 2024

... forgot to comment that I'd finished with this retrofit, and I think it's ready for CR.

Thinking more about potential python wizardry, I think the only reason to do that rather than the explicit parameterization I've implemented here is if we think we might change things later on. This kind of thing happens more for things like deprecation warnings / parameter renames / etc., which can change frequently over the lifecycle of a project, so there's benefit to making it easy to add or remove at any point. I don't think that applies in this case.

I do think we need validation bypassing as implemented here if we want all of the following:

  • inputs are (can be) validated regardless of how a metric is called (directly or through evaluate)
  • warning stacklevels point to the call site for the user and not within mir_eval
  • validation can be disabled by the user

The only remaining question for me right now is whether the stacklevel should be 2 or 3. The call path looks something like:

validate (1) ← [evaluate || specific metric] (2) ← user code (3)

main branch currently leaves it at 1, which is not helpful. PR currently does 2, which would point to either evaulate or the specific metric being called; this seems marginally more helpful. I think (3) might be best as it points to where the user is entering the mir_eval code path. The only downside here is that if a user is calling validate() directly, then the stack level might point one level higher up than it should. I expect this case is exceedingly rare and not something we should worry about, but I did want to note it here.

@craffel
Copy link
Collaborator

craffel commented Mar 28, 2024

Sorry for the delay in responding, as I mentioned my latency is a bit unpredictable these days. I appreciate the work you put into this but I don't think adding an additional argument to most user-facing functions in the library is the right solution - this hurts usability/readability significantly in addition to potential backwards compatibility issues with positional arguments and therefore should be a last resort. I also don't think it's important to support users disabling validation - the validation logic is there to catch problems that otherwise would cause issues and throw more helpful errors/warnings; the only reason it's in a separate function is to avoid code duplication in all of the metric functions. If there was no other option, I think it would make sense to consider adding an arg, but in this case the issue it solves is minor enough (repeated warnings for malformed inputs when evaluate is called) that I'm not even sure it's warranted. In any case I don't think we need to address this issue or come up with a solution for 0.8.0 so I would be most inclined to punt on it for now.

Regarding the stack level, I'm not sure since IME the stack level reported by warnings is almost never helpful... but I agree that validate should almost never be called by an end-user so (3) is probably best.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 28, 2024

his hurts usability/readability significantly in addition to potential backwards compatibility issues with positional arguments and therefore should be a last resort.

I don't think this affects positional arguments in existing code, as it's always the last new (positional) parameter.

I also don't see how this hurts usability or readability: the default behavior is preserved, and the behavior is explicitly stated in documentation. At most it's one extra piece of information in the docs?

In any case I don't think we need to address this issue or come up with a solution for 0.8.0 so I would be most inclined to punt on it for now.

I'm okay with punting if you think it's really unimportant.

Regarding the stack level, I'm not sure since IME the stack level reported by warnings is almost never helpful... but I agree that validate should almost never be called by an end-user so (3) is probably best.

I actually do use warning stacklevels somewhat often, so if we have a chance to do it right, I think we should. That said, I don't think we can do it right without the change I've implemented here, since there are two alternative routes to the validate function with different stacklevels (user→metric→validate and user→evaluate→metric→validate). This is why I went to the effort of lifting out the validate call and adding bypass options at all. I agree with your earlier point that disabling validation entirely is of negligible utility, but I don't think there's another good option here that still produces a correct stacklevel.

(A grungier alternative would be to pass the target stacklevel through to validate, but that's much less generally readable or usable than a bypass option.)

@craffel
Copy link
Collaborator

craffel commented Mar 29, 2024

I also don't see how this hurts usability or readability: the default behavior is preserved, and the behavior is explicitly stated in documentation. At most it's one extra piece of information in the docs?

Adding an argument to almost all user-facing functions that is almost never used harms readability and usability. I'm not saying it's catastrophic, I'm just saying it should be a last resort - if it can be avoided (which I believe it can) it should be avoided.

Regarding the stack level, thinking about it more, I think stack level 2 is reasonable - it's flagging for the user that the warning is coming from some particular metric function. If that's inside an evaluate call, that's ok, because ultimately all evaluate does is do some preprocessing and then call the metric functions. I don't think we can in general assume some point in the stack level corresponds to the point where the user has constructed and passed off the data being evaluated, because one level above the evaluate/metric function call could either be user code or some other library code that the user is using on top of mir_eval.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 29, 2024

Ok, I'll close this out then.

@craffel
Copy link
Collaborator

craffel commented Apr 4, 2024

Just to clarify, I wasn't saying to nuke this entirely - I still think stacklevel=2 is a QoL improvement over the current behavior and is a simple change that should be merged in.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Apr 5, 2024

Just to clarify, I wasn't saying to nuke this entirely - I still think stacklevel=2 is a QoL improvement over the current behavior and is a simple change that should be merged in.

Yeah I understand. But as I've said above, stacklevel=2 without accounting for the alternative code paths is not actually that helpful. In fact, it can be worse than status quo because it triggers separate warnings every time the data is validated (when called by evaluate()):

In [9]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:165: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:207: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:267: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:359: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:454: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:614: UserWarning: Estimated beats are empty.
  validate(reference_beats, estimated_beats)

Since these are coming from different lines in the module (due to stacklevel) these appear to be different, and are thus not absorbed by the "once" default rule for UserWarnings. When stacklevel is not set (status quo), repeat warnings appear to come from the same place, and are (correctly, imo) absorbed by the warnings filter:

In [6]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:93: UserWarning: Estimated beats are empty.
  warnings.warn("Estimated beats are empty.")

Now, Python 3.12 actually added some functionality to do this in exactly the right way with the skip_file_prefixes parameter, but I don't think we can bump the minimum python to 3.12 just for this feature.

Which is all to say: I don't think setting stacklevel=2 universally is correct or more helpful than the current behavior, and I also don't see a cleaner way to achieve the most helpful behavior than what I'd implemented in this PR.

@craffel
Copy link
Collaborator

craffel commented Apr 12, 2024

Ah, I misunderstood the issue. I'll work on a solution to avoiding repeat calls to validate after my leave ends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set warning stacklevels for validation
3 participants