Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mic-716/Coerce empty strings to np.nan for PO Box addresses #354

Merged
merged 8 commits into from
Dec 1, 2023

Conversation

albrja
Copy link
Contributor

@albrja albrja commented Nov 30, 2023

Mic-716/Coerce empty strings to np.nan for PO box addresses

Coerces null value from empty string to np.nan for addresses with a PO box.

-Coerces null value from empty string to np.nan for addresses with a PO box. This issue was present for full USA dataset. Note the new full USA data did not have this bug so I suspect the recent pandas 2.1 update may have fixed this issue.

Testing

Tested on two different full USA datasets and both will output correct null values for mailing address columns.

Base automatically changed from mic-4671/user-warnings to release-candidate/v1.0 December 1, 2023 00:33
@zmbc
Copy link
Collaborator

zmbc commented Dec 1, 2023

Can you confirm/check that the columns you've explicitly called out here are the only ones that contain empty strings, at least for a few shards?

@albrja albrja force-pushed the mic-4716/po-box-street-details-nans branch from 674d245 to 009f95d Compare December 1, 2023 17:35
) -> pd.DataFrame:
# Coerce dtypes prior to noising to catch issues early as well as
# get most columns away from dtype 'category' and into 'object' (strings)
for col in dataset.columns:
if cleanse_int_cols and col.name in INT_COLUMNS:
data[col.name] = cleanse_integer_columns(data[col.name])
# Coerce emtpy strings to NaNs for mailing address columns that have PO boxes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: emtpy

@albrja albrja merged commit ffcad1e into release-candidate/v1.0 Dec 1, 2023
6 checks passed
@albrja albrja deleted the mic-4716/po-box-street-details-nans branch December 1, 2023 20:54
rmudambi pushed a commit that referenced this pull request Jan 19, 2024
Mic-716/Coerce empty strings to np.nan for PO box addresses
Coerces null value from empty string to np.nan for addresses with a PO box.
- *Category*: Bugfix
- *JIRA issue*: [MIC-4716](https://jira.ihme.washington.edu/browse/MIC-4717)

-Coerces null value from empty string to np.nan for addresses with a PO box. This issue was present for full USA dataset. Note the new full USA data did not have this bug so I suspect the recent pandas 2.1 update may have fixed this issue.

Testing
Tested on two different full USA datasets and both will output correct null values for mailing address columns.
albrja added a commit that referenced this pull request Feb 12, 2024
Mic-716/Coerce empty strings to np.nan for PO box addresses
Coerces null value from empty string to np.nan for addresses with a PO box.
- *Category*: Bugfix
- *JIRA issue*: [MIC-4716](https://jira.ihme.washington.edu/browse/MIC-4717)

-Coerces null value from empty string to np.nan for addresses with a PO box. This issue was present for full USA dataset. Note the new full USA data did not have this bug so I suspect the recent pandas 2.1 update may have fixed this issue.

Testing
Tested on two different full USA datasets and both will output correct null values for mailing address columns.
albrja added a commit that referenced this pull request Feb 12, 2024
Mic-716/Coerce empty strings to np.nan for PO box addresses
Coerces null value from empty string to np.nan for addresses with a PO box.
- *Category*: Bugfix
- *JIRA issue*: [MIC-4716](https://jira.ihme.washington.edu/browse/MIC-4717)

-Coerces null value from empty string to np.nan for addresses with a PO box. This issue was present for full USA dataset. Note the new full USA data did not have this bug so I suspect the recent pandas 2.1 update may have fixed this issue.

Testing
Tested on two different full USA datasets and both will output correct null values for mailing address columns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants