Redo/generalize/tighten args shorthand #530

knighton · 2023-12-12T09:31:07Z

Get more specific functionality in place to set the stage for the more interesting PRs.

Beeves I have with existing approach:

Miscalculates when you take decimal amounts of sufficiently large units and are unlucky in floating point. Solved by doing everything in int.
Sometimes people need base-1000, sometimes people need base-1024. Solved by using the standard set of suffixes where you infix an i when using base-1024 (1gb vs 1gib). Then we can recognize both at the same time, or either.
Having the flexibility to accept both uppercase and lowercase or any combination thereof for units is a Tower of Babel situation. People are inevitably going to make dialects out of this flexibility. We will totes get complaints that they said Gb and didn't get gigabits.
Similarly, the flexibility to put spaces before, after, or in-between parts of the argument is TIMTOWDI and very fashionista, not pythonista, let's have a single canonical form.
Missed oppty to support multiple parts, then we can also do things like 1h23m45s.

Paths:

normalize_bytes -> normalize_dec_bytes, normalize_bin_bytes -> _normalize_nonneg_int -> _normalize_int -> _normalize_num -> _normalize_arg.
normalize_count -> _normalize_nonneg_int -> _normalize_int -> _normalize_num -> _normalize_arg.
normalize_duration -> _normalize_float -> _normalize_num -> _normalize_arg.

Steps of the _normalize_arg algorithm:

Must be non-empty.
Drop commas and underscores (useful to demarcate thousands '1,337' or '1_337').
Must start with a digit.
Must alternative between numbers and units, starting with a number.
If just a number, that's it.
Pair up numbers and units: (a) special case where the implied unit is the empty string, i.e. the smallest unit; (b) if not just a number, each number must be paired with a corresponding unit.
Assign parts as numbers and units.
Each number before the last one must be integral.
Parse out the digits of the final number, which may be fractional.
Parse the digits as an integer for exact precision, no float nonsense.
Each unit must be known to us.
Each unit must be used at most once.
Units must be listed in descending order of size.
The number of any given part must not exceed the size of the next biggest part's unit. (Otherwise you would just roll its overage into the next biggest part.)
Collect parts, with last part being possibly scaled down to account for a decimal point.

Example of how configuration and functionality are decomposed in this PR:

_count_units = _get_units(1000, ' k m b t'.split(' '))


def normalize_count(count: Union[int, str]) -> int:
    """Normalize from human-friendly count to int.

    Args:
        count (int | str): Human-friendly count.

    Returns:
        int: Integral count.
    """
    return _normalize_nonneg_int(count, _count_units)

tests/test_shorthand.py

simulation/core/sim_dataset.py

tests/test_shorthand.py

streaming/util/shorthand.py

snarayan21

main things are testing simulation changes and seeing if the humanize package fulfills some of our needs here!

snarayan21 · 2023-12-12T23:40:19Z

simulation/interfaces/interface_utils.py

@@ -88,7 +88,7 @@ def get_train_dataset_params(input_params: dict, old_params: Optional[dict] = No
    train_dataset_params['cache_limit'] = input_params['cache_limit']
    train_dataset_params['shuffle'] = input_params['shuffle']
    train_dataset_params['shuffle_algo'] = input_params['shuffle_algo']
-    train_dataset_params['shuffle_block_size'] = number_abbrev_to_int(
+    train_dataset_params['shuffle_block_size'] = normalize_count(


have you been able to test these changes? could you make sure these numbers are displaying and being passed as intended by running simulator and passing in values for these parameters?

If you could add a short "testing" section to the PR description that would be great as well!

what is the procedure for checking i didn't break the simulator?

snarayan21 · 2023-12-12T23:42:58Z

streaming/util/shorthand.py

@@ -1,115 +1,371 @@
 # Copyright 2023 MosaicML Streaming authors
 # SPDX-License-Identifier: Apache-2.0

-"""Utilities for human-friendly argument shorthand."""
+"""Conversions between human-friendly string forms and int/float."""


just a suggestion -- would the humanize library be applicable for some of these functions? Would be nice to use an external library for stuff like this, removes some burden on us as well.

Observations for posterity:

the humanfriendly library is very pretty

But it lacks "counts"

So we are rolling our own functionality either way for that vertical

Also, we are tight on time and this is enough, so I figure let's just go with this and revisit in 2024

* Redo/generalize/tighten args shorthand, clean up usage, update tests. * Fix (cruft). * Fix (typo). * Fix (reference to member). * Tweak. * Divide tests/test_util.py into tests/util/....py. * Fix. * Error messages. * Lowercase, no space.

knighton added 5 commits December 12, 2023 01:26

Redo/generalize/tighten args shorthand, clean up usage, update tests.

2f344d2

Fix (cruft).

0506485

Fix (typo).

f3fbd08

Fix (reference to member).

6cf0618

Tweak.

8e6fb33

karan6181 requested review from snarayan21, karan6181 and XiaohanZhangCMU December 12, 2023 17:15

XiaohanZhangCMU reviewed Dec 12, 2023

View reviewed changes

tests/test_shorthand.py Outdated Show resolved Hide resolved

XiaohanZhangCMU approved these changes Dec 12, 2023

View reviewed changes

karan6181 reviewed Dec 12, 2023

View reviewed changes

simulation/core/sim_dataset.py Outdated Show resolved Hide resolved

tests/test_shorthand.py Outdated Show resolved Hide resolved

karan6181 reviewed Dec 12, 2023

View reviewed changes

streaming/util/shorthand.py Outdated Show resolved Hide resolved

streaming/util/shorthand.py Outdated Show resolved Hide resolved

snarayan21 reviewed Dec 12, 2023

View reviewed changes

knighton added 3 commits December 12, 2023 15:46

Divide tests/test_util.py into tests/util/....py.

b23e816

Fix.

a214ecb

Error messages.

6e2c41d

karan6181 approved these changes Dec 13, 2023

View reviewed changes

Lowercase, no space.

2e99b1a

knighton merged commit 7c3fa05 into dev Dec 14, 2023
5 checks passed

knighton deleted the james/arg-shorthand branch December 14, 2023 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redo/generalize/tighten args shorthand #530

Redo/generalize/tighten args shorthand #530

knighton commented Dec 12, 2023 •

edited

Loading

snarayan21 left a comment

snarayan21 Dec 12, 2023

snarayan21 Dec 12, 2023

knighton Dec 13, 2023

snarayan21 Dec 12, 2023

knighton Dec 14, 2023

Redo/generalize/tighten args shorthand #530

Redo/generalize/tighten args shorthand #530

Conversation

knighton commented Dec 12, 2023 • edited Loading

snarayan21 left a comment

Choose a reason for hiding this comment

snarayan21 Dec 12, 2023

Choose a reason for hiding this comment

snarayan21 Dec 12, 2023

Choose a reason for hiding this comment

knighton Dec 13, 2023

Choose a reason for hiding this comment

snarayan21 Dec 12, 2023

Choose a reason for hiding this comment

knighton Dec 14, 2023

Choose a reason for hiding this comment

knighton commented Dec 12, 2023 •

edited

Loading