Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo/generalize/tighten args shorthand #530

Merged
merged 9 commits into from
Dec 14, 2023
Merged

Redo/generalize/tighten args shorthand #530

merged 9 commits into from
Dec 14, 2023

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Dec 12, 2023

Get more specific functionality in place to set the stage for the more interesting PRs.

Beeves I have with existing approach:

  1. Miscalculates when you take decimal amounts of sufficiently large units and are unlucky in floating point. Solved by doing everything in int.
  2. Sometimes people need base-1000, sometimes people need base-1024. Solved by using the standard set of suffixes where you infix an i when using base-1024 (1gb vs 1gib). Then we can recognize both at the same time, or either.
  3. Having the flexibility to accept both uppercase and lowercase or any combination thereof for units is a Tower of Babel situation. People are inevitably going to make dialects out of this flexibility. We will totes get complaints that they said Gb and didn't get gigabits.
  4. Similarly, the flexibility to put spaces before, after, or in-between parts of the argument is TIMTOWDI and very fashionista, not pythonista, let's have a single canonical form.
  5. Missed oppty to support multiple parts, then we can also do things like 1h23m45s.

Paths:

  1. normalize_bytes -> normalize_dec_bytes, normalize_bin_bytes -> _normalize_nonneg_int -> _normalize_int -> _normalize_num -> _normalize_arg.
  2. normalize_count -> _normalize_nonneg_int -> _normalize_int -> _normalize_num -> _normalize_arg.
  3. normalize_duration -> _normalize_float -> _normalize_num -> _normalize_arg.

Steps of the _normalize_arg algorithm:

  1. Must be non-empty.
  2. Drop commas and underscores (useful to demarcate thousands '1,337' or '1_337').
  3. Must start with a digit.
  4. Must alternative between numbers and units, starting with a number.
  5. If just a number, that's it.
  6. Pair up numbers and units: (a) special case where the implied unit is the empty string, i.e. the smallest unit; (b) if not just a number, each number must be paired with a corresponding unit.
  7. Assign parts as numbers and units.
  8. Each number before the last one must be integral.
  9. Parse out the digits of the final number, which may be fractional.
  10. Parse the digits as an integer for exact precision, no float nonsense.
  11. Each unit must be known to us.
  12. Each unit must be used at most once.
  13. Units must be listed in descending order of size.
  14. The number of any given part must not exceed the size of the next biggest part's unit. (Otherwise you would just roll its overage into the next biggest part.)
  15. Collect parts, with last part being possibly scaled down to account for a decimal point.

Example of how configuration and functionality are decomposed in this PR:

_count_units = _get_units(1000, ' k m b t'.split(' '))


def normalize_count(count: Union[int, str]) -> int:
    """Normalize from human-friendly count to int.

    Args:
        count (int | str): Human-friendly count.

    Returns:
        int: Integral count.
    """
    return _normalize_nonneg_int(count, _count_units)

simulation/core/sim_dataset.py Outdated Show resolved Hide resolved
tests/test_shorthand.py Outdated Show resolved Hide resolved
streaming/util/shorthand.py Outdated Show resolved Hide resolved
streaming/util/shorthand.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main things are testing simulation changes and seeing if the humanize package fulfills some of our needs here!

@@ -88,7 +88,7 @@ def get_train_dataset_params(input_params: dict, old_params: Optional[dict] = No
train_dataset_params['cache_limit'] = input_params['cache_limit']
train_dataset_params['shuffle'] = input_params['shuffle']
train_dataset_params['shuffle_algo'] = input_params['shuffle_algo']
train_dataset_params['shuffle_block_size'] = number_abbrev_to_int(
train_dataset_params['shuffle_block_size'] = normalize_count(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you been able to test these changes? could you make sure these numbers are displaying and being passed as intended by running simulator and passing in values for these parameters?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could add a short "testing" section to the PR description that would be great as well!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the procedure for checking i didn't break the simulator?

@@ -1,115 +1,371 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

"""Utilities for human-friendly argument shorthand."""
"""Conversions between human-friendly string forms and int/float."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a suggestion -- would the humanize library be applicable for some of these functions? Would be nice to use an external library for stuff like this, removes some burden on us as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observations for posterity:

  • the humanfriendly library is very pretty
  • But it lacks "counts"
  • So we are rolling our own functionality either way for that vertical
  • Also, we are tight on time and this is enough, so I figure let's just go with this and revisit in 2024

@knighton knighton merged commit 7c3fa05 into dev Dec 14, 2023
5 checks passed
@knighton knighton deleted the james/arg-shorthand branch December 14, 2023 04:55
karan6181 pushed a commit that referenced this pull request Jan 26, 2024
* Redo/generalize/tighten args shorthand, clean up usage, update tests.

* Fix (cruft).

* Fix (typo).

* Fix (reference to member).

* Tweak.

* Divide tests/test_util.py into tests/util/....py.

* Fix.

* Error messages.

* Lowercase, no space.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants