Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats for datetimes #3007

Open
wants to merge 52 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
79790e0
compute stats for datetimes
polinaeterna Jul 31, 2024
12f78cc
Merge branch 'main' into datetime-stats
polinaeterna Jul 31, 2024
851ec1b
fix typing
polinaeterna Jul 31, 2024
3347c13
add testcase
polinaeterna Aug 1, 2024
0340b54
moar tests: column with nulls and all nulls column
polinaeterna Aug 5, 2024
4cd6e0d
Merge branch 'main' into datetime-stats
polinaeterna Aug 5, 2024
434b2d8
add datetime to worker
polinaeterna Aug 8, 2024
2604587
add test
polinaeterna Aug 8, 2024
913f812
include timezone aware
polinaeterna Aug 9, 2024
06c1ae5
Merge branch 'main' into datetime-stats
polinaeterna Aug 12, 2024
7f7ecab
Merge branch 'main' into datetime-stats
polinaeterna Oct 14, 2024
d517393
refactor
polinaeterna Oct 14, 2024
7046d8b
fix
polinaeterna Oct 14, 2024
945dff0
do not typecheck dateutil
polinaeterna Oct 14, 2024
d91d365
Merge branch 'main' into datetime-stats
polinaeterna Dec 20, 2024
bdec2e4
fix
polinaeterna Dec 23, 2024
f9ffe82
more tests
polinaeterna Dec 23, 2024
d2c37c6
fix string to datetime conversion: add format inferring
polinaeterna Dec 26, 2024
658719e
fix style
polinaeterna Dec 26, 2024
5c2d94a
fix check for datetime
polinaeterna Dec 27, 2024
359a30b
minor
polinaeterna Dec 27, 2024
0744e07
mypy
polinaeterna Dec 27, 2024
53e2100
add testcase
polinaeterna Jan 6, 2025
a61108f
Merge branch 'main' into datetime-stats
polinaeterna Jan 7, 2025
c63e70e
Merge branch 'main' into datetime-stats
polinaeterna Jan 8, 2025
70197aa
Merge branch 'datetime-stats' of github.com:huggingface/datasets-serv…
polinaeterna Jan 8, 2025
3df6264
fix?
polinaeterna Jan 8, 2025
812bf36
add example to docs
polinaeterna Jan 8, 2025
c68efb7
fix + add tz string (%Z) to formats
polinaeterna Jan 9, 2025
351ef5c
test for string timezone
polinaeterna Jan 9, 2025
787ad3b
try to debug
polinaeterna Jan 10, 2025
5163500
test identify_datetime_format
polinaeterna Jan 10, 2025
033e29e
test datetime.strptime
polinaeterna Jan 13, 2025
349b651
test
polinaeterna Jan 13, 2025
6c60c27
Update services/worker/src/worker/statistics_utils.py
polinaeterna Jan 15, 2025
db10500
keep original timezone for string dates
polinaeterna Jan 15, 2025
8794b7a
let polars identify datetime format by itself
polinaeterna Jan 15, 2025
e0e7c91
do not display +0000 in timestamps (if timezone is UTC)
polinaeterna Jan 15, 2025
8afade1
remove utils test
polinaeterna Jan 15, 2025
341676c
refactor: identify datetime format manually only when polars failed
polinaeterna Jan 15, 2025
3b5d950
style
polinaeterna Jan 16, 2025
21977db
log formats in error message
polinaeterna Jan 16, 2025
0ee76bf
update openapi specs
polinaeterna Jan 16, 2025
b7fee0b
fallback to string stats if datetime didn't work
polinaeterna Jan 16, 2025
6a76dd9
fix test
polinaeterna Jan 16, 2025
f3eefea
update docs
polinaeterna Jan 16, 2025
a79eb79
Merge branch 'main' into datetime-stats
polinaeterna Jan 16, 2025
1df95ff
fix openapi specs
polinaeterna Jan 16, 2025
2f27846
Merge branch 'main' into datetime-stats
polinaeterna Jan 17, 2025
f9d7a8a
fix polars timezone switching
polinaeterna Jan 17, 2025
720aab9
Merge branch 'main' into datetime-stats
polinaeterna Jan 17, 2025
f84083f
Merge branch 'main' into datetime-stats
polinaeterna Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
425 changes: 424 additions & 1 deletion docs/source/openapi.json

Large diffs are not rendered by default.

73 changes: 66 additions & 7 deletions docs/source/statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ The response JSON contains three keys:

## Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, lists, audio and image data and the special [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature type of the [`datasets`](https://huggingface.co/docs/datasets/) library.
Currently, statistics are supported for strings, float and integer numbers, lists, datetimes, audio and image data and the special [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature type of the [`datasets`](https://huggingface.co/docs/datasets/) library.

`column_type` in response can be one of the following values:

Expand All @@ -178,6 +178,7 @@ Currently, statistics are supported for strings, float and integer numbers, list
* `list` - for lists of any other data types (including lists)
* `audio` - for audio data
* `image` - for image data
* `datetime` - for datetime data

### `class_label`

Expand Down Expand Up @@ -216,7 +217,7 @@ This type represents categorical data encoded as [`ClassLabel`](https://huggingf

The following measures are returned for float data types:

* minimum, maximum, mean, and standard deviation values
* minimum, maximum, mean, median, and standard deviation values
* number and proportion of `null` and `NaN` values (`NaN` values are treated as `null`)
* histogram with 10 bins

Expand Down Expand Up @@ -273,7 +274,7 @@ The following measures are returned for float data types:

The following measures are returned for integer data types:

* minimum, maximum, mean, and standard deviation values
* minimum, maximum, mean, median, and standard deviation values
* number and proportion of `null` values
* histogram with less than or equal to 10 bins

Expand Down Expand Up @@ -377,7 +378,7 @@ If the proportion of unique values in a string column within requested split is

If string column does not satisfy the conditions to be treated as a `string_label`, it is considered to be a column containing texts and response contains statistics over text lengths which are calculated by character number. The following measures are computed:

* minimum, maximum, mean, and standard deviation of text lengths
* minimum, maximum, mean, median, and standard deviation of text lengths
* number and proportion of `null` values
* histogram of text lengths with 10 bins

Expand Down Expand Up @@ -434,7 +435,7 @@ If string column does not satisfy the conditions to be treated as a `string_labe

For lists, the distribution of their lengths is computed. The following measures are returned:

* minimum, maximum, mean, and standard deviation of lists lengths
* minimum, maximum, mean, median, and standard deviation of lists lengths
* number and proportion of `null` values
* histogram of lists lengths with up to 10 bins

Expand Down Expand Up @@ -480,7 +481,7 @@ Note that dictionaries of lists are not supported.

For audio data, the distribution of audio files durations is computed. The following measures are returned:

* minimum, maximum, mean, and standard deviation of audio files durations
* minimum, maximum, mean, median, and standard deviation of audio files durations
* number and proportion of `null` values
* histogram of audio files durations with 10 bins

Expand Down Expand Up @@ -539,7 +540,7 @@ For audio data, the distribution of audio files durations is computed. The follo

For image data, the distribution of images widths is computed. The following measures are returned:

* minimum, maximum, mean, and standard deviation of widths of image files
* minimum, maximum, mean, median, and standard deviation of widths of image files
* number and proportion of `null` values
* histogram of images widths with 10 bins

Expand Down Expand Up @@ -591,3 +592,61 @@ For image data, the distribution of images widths is computed. The following mea

</p>
</details>

### datetime

The distribution of datetime is computed. The following measures are returned:

* minimum, maximum, mean, median, and standard deviation of datetimes represented as strings with precision up to seconds
* number and proportion of `null` values
* histogram of datetimes with 10 bins

<details><summary>Example </summary>
<p>

```json
{
"column_name": "date",
"column_type": "datetime",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": "2013-05-18 04:54:11",
"max": "2013-06-20 10:01:41",
"mean": "2013-05-27 18:03:39",
"median": "2013-05-23 11:55:50",
"std": "11 days, 4:57:32.322450",
"histogram": {
"hist": [
318776,
393036,
173904,
0,
0,
0,
0,
0,
0,
206284
],
"bin_edges": [
"2013-05-18 04:54:11",
"2013-05-21 12:36:57",
"2013-05-24 20:19:43",
"2013-05-28 04:02:29",
"2013-05-31 11:45:15",
"2013-06-03 19:28:01",
"2013-06-07 03:10:47",
"2013-06-10 10:53:33",
"2013-06-13 18:36:19",
"2013-06-17 02:19:05",
"2013-06-20 10:01:41"
]
}
}
}
```

</p>
</details>

1 change: 1 addition & 0 deletions libs/libcommon/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ module = [
"moto.*",
"aiobotocore.*",
"requests.*",
"dateutil.*"
]
# ^ huggingface_hub is not typed since version 0.13.0
ignore_missing_imports = true
Expand Down
74 changes: 74 additions & 0 deletions libs/libcommon/src/libcommon/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import orjson
import pandas as pd
import pytz
from dateutil import parser
from huggingface_hub import constants, hf_hub_download
from requests.exceptions import ReadTimeout

Expand Down Expand Up @@ -93,6 +94,79 @@ def get_datetime(days: Optional[float] = None) -> datetime:
return date


def is_datetime(string: str) -> bool:
try:
parser.parse(string)
return True
except ValueError:
return False


def get_timezone(string: str) -> Any:
return parser.parse(string).tzinfo


def datetime_to_string(dt: datetime, format: str = "%Y-%m-%d %H:%M:%S%z") -> str:
if dt.utcoffset() == timedelta(0):
format = "%Y-%m-%d %H:%M:%S" # do not display +0000
return dt.strftime(format)


def identify_datetime_format(datetime_string: str) -> Optional[str]:
# Common datetime formats
common_formats = [
"%Y-%m-%dT%H:%M:%S%Z",
"%Y-%m-%dT%H:%M:%S%z",
"%Y-%m-%dT%H:%M:%S",
"%Y-%m-%dT%H:%M:%S.%f",
"%Y-%m-%d %H:%M:%S%Z",
"%Y-%m-%d %H:%M:%S%z",
"%Y-%m-%d %H:%M:%S",
"%Y-%m-%d %H:%M",
"%Y-%m-%d",
"%d-%m-%Y %H:%M:%S%Z",
"%d-%m-%Y %H:%M:%S%z",
"%d-%m-%Y %H:%M:%S",
"%d-%m-%Y %H:%M",
"%d-%m-%Y",
"%m-%d-%Y %H:%M:%S%Z",
"%m-%d-%Y %H:%M:%S%z",
"%m-%d-%Y %H:%M:%S",
"%m-%d-%Y %H:%M",
"%m-%d-%Y",
"%Y/%m/%d %H:%M:%S%Z",
"%Y/%m/%d %H:%M:%S%z",
"%Y/%m/%d %H:%M:%S",
"%Y/%m/%d %H:%M",
"%Y/%m/%d",
"%d/%m/%Y %H:%M:%S%Z",
"%d/%m/%Y %H:%M:%S%z",
"%d/%m/%Y %H:%M:%S",
"%d/%m/%Y %H:%M",
"%d/%m/%Y",
"%m/%d/%Y %H:%M:%S%Z",
"%m/%d/%Y %H:%M:%S%z",
"%m/%d/%Y %H:%M:%S",
"%m/%d/%Y %H:%M",
"%m/%d/%Y",
"%B %d, %Y",
"%d %B %Y",
"%m-%Y",
"%Y-%m",
"%m/%Y",
"%Y/%m",
"%Y",
]

for fmt in common_formats:
try:
_ = datetime.strptime(datetime_string, fmt)
return fmt
except ValueError:
continue
return None


def get_duration(started_at: datetime) -> float:
"""
Get time in seconds that has passed from `started_at` until now.
Expand Down
54 changes: 54 additions & 0 deletions services/worker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ The response has three fields: `num_examples`, `statistics`, and `partial`. `par
* `list` - for lists of other data types (including lists)
* `audio` - for audio data
* `image` - for image data
* `datetime` - for datetime data

`column_statistics` content depends on the feature type, see examples below.
##### class_label
Expand Down Expand Up @@ -591,6 +592,59 @@ Shows distribution of image files widths.
</p>
</details>


##### datetime

Shows distribution of datetimes.

<details><summary>example: </summary>
<p>

```python
{
"column_name": "date",
"column_type": "datetime",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": "2013-05-18 04:54:11",
"max": "2013-06-20 10:01:41",
"mean": "2013-05-27 18:03:39",
"median": "2013-05-23 11:55:50",
"std": "11 days, 4:57:32.322450",
"histogram": {
"hist": [
318776,
393036,
173904,
0,
0,
0,
0,
0,
0,
206284
],
"bin_edges": [
"2013-05-18 04:54:11",
"2013-05-21 12:36:57",
"2013-05-24 20:19:43",
"2013-05-28 04:02:29",
"2013-05-31 11:45:15",
"2013-06-03 19:28:01",
"2013-06-07 03:10:47",
"2013-06-10 10:53:33",
"2013-06-13 18:36:19",
"2013-06-17 02:19:05",
"2013-06-20 10:01:41"
]
}
}
}
```
</p>
</details>

### Splits worker

The `splits` worker does not need any additional configuration.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
AudioColumn,
BoolColumn,
ClassLabelColumn,
DatetimeColumn,
FloatColumn,
ImageColumn,
IntColumn,
Expand All @@ -57,7 +58,15 @@ class SplitDescriptiveStatisticsResponse(TypedDict):


SupportedColumns = Union[
ClassLabelColumn, IntColumn, FloatColumn, StringColumn, BoolColumn, ListColumn, AudioColumn, ImageColumn
ClassLabelColumn,
IntColumn,
FloatColumn,
StringColumn,
BoolColumn,
ListColumn,
AudioColumn,
ImageColumn,
DatetimeColumn,
]


Expand Down Expand Up @@ -215,29 +224,34 @@ def _column_from_feature(
return ListColumn(feature_name=dataset_feature_name, n_samples=num_examples)

if isinstance(dataset_feature, dict):
if dataset_feature.get("_type") == "ClassLabel":
_type = dataset_feature.get("_type")
if _type == "ClassLabel":
return ClassLabelColumn(
feature_name=dataset_feature_name, n_samples=num_examples, feature_dict=dataset_feature
)

if dataset_feature.get("_type") == "Audio":
if _type == "Audio":
return AudioColumn(feature_name=dataset_feature_name, n_samples=num_examples)

if dataset_feature.get("_type") == "Image":
if _type == "Image":
return ImageColumn(feature_name=dataset_feature_name, n_samples=num_examples)

if dataset_feature.get("_type") == "Value":
if dataset_feature.get("dtype") in INTEGER_DTYPES:
if _type == "Value":
dtype = dataset_feature.get("dtype", "")
if dtype in INTEGER_DTYPES:
return IntColumn(feature_name=dataset_feature_name, n_samples=num_examples)

if dataset_feature.get("dtype") in FLOAT_DTYPES:
if dtype in FLOAT_DTYPES:
return FloatColumn(feature_name=dataset_feature_name, n_samples=num_examples)

if dataset_feature.get("dtype") in STRING_DTYPES:
if dtype in STRING_DTYPES:
return StringColumn(feature_name=dataset_feature_name, n_samples=num_examples)

if dataset_feature.get("dtype") == "bool":
if dtype == "bool":
return BoolColumn(feature_name=dataset_feature_name, n_samples=num_examples)

if dtype.startswith("timestamp"):
return DatetimeColumn(feature_name=dataset_feature_name, n_samples=num_examples)
return None

columns: list[SupportedColumns] = []
Expand All @@ -249,7 +263,7 @@ def _column_from_feature(
if not columns:
raise NoSupportedFeaturesError(
"No columns for statistics computation found. Currently supported feature types are: "
f"{NUMERICAL_DTYPES}, {STRING_DTYPES}, ClassLabel, list/Sequence and bool. "
f"{NUMERICAL_DTYPES}, {STRING_DTYPES}, ClassLabel, Image, Audio, list/Sequence, datetime and bool. "
)

column_names_str = ", ".join([column.name for column in columns])
Expand Down
Loading
Loading