Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEN-867] Add validation rule to check if INT_DOD >= INT_CONTACT #561

Merged
merged 2 commits into from
Apr 28, 2024

Conversation

danlu1
Copy link
Contributor

@danlu1 danlu1 commented Apr 25, 2024

Problem:

Lack of validation check on whether INT_DOD >= INT_CONTACT

Solution:

Add the validation check to compare INT_DOD and INT_CONTACT when both of them exist and not NAs.

Testing:
Unit test has been added.

@danlu1 danlu1 requested a review from a team as a code owner April 25, 2024 22:43
Returns:
pd.Index: The row indices of the row with INT_DOD < INT_CONTACT in the input clinical data
"""
# Generate temp dataframe to handle datatype mismatch in a column

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to limit the amount of in-line comments you add. In a lot of cases it has the unintended affect of reducing readability. The code you wrote does well at telling us what the code is doing. That should be the source of truth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do.

# Convert INT_DOD and INT_CONTACT to numeric, coercing errors to NaN
temp["INT_DOD"] = pd.to_numeric(temp["INT_DOD"], errors="coerce")
temp["INT_CONTACT"] = pd.to_numeric(temp["INT_CONTACT"], errors="coerce")
# Compare rows with numeric values in both columns and returns comparion results("True"/"False")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inline comment is misleading. You are only checking for False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm let me rephrase this.



def _check_int_dod_validity_message(
invalid_int_dod_indices: pd.Index,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the type here a list of pd.Index? The length check below suggests it's an iterator of values.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also just be the pandas specific way of doing things that is not super intuitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is pandas.core.indexes.range.RangeIndex. We can either use len or size.
Screenshot 2024-04-26 at 11 31 37 AM

)
if has_int_dod_and_contact:
invalid_int_dod_indices = _check_int_dod_validity(clinicaldf)
errors, warnings = _check_int_dod_validity_message(invalid_int_dod_indices)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
errors, _
As warnings is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adapted from

def _validate_oncotree_code_mapping(
self: "Clinical", clinicaldf: pd.DataFrame, oncotree_mapping: pd.DataFrame
) -> pd.Index:
"""Checks that the oncotree codes in the input clinical
data is a valid oncotree code from the official oncotree site
Args:
clinicaldf (pd.DataFrame): clinical input data to validate
oncotree_mapping (pd.DataFrame): table of official oncotree
mappings
Returns:
pd.Index: row indices of unmapped oncotree codes in the
input clinical data
"""
# Make oncotree codes uppercase (SpCC/SPCC)
clinicaldf["ONCOTREE_CODE"] = (
clinicaldf["ONCOTREE_CODE"].astype(str).str.upper()
)
unmapped_oncotrees = clinicaldf[
(clinicaldf["ONCOTREE_CODE"] != "UNKNOWN")
& ~(clinicaldf["ONCOTREE_CODE"].isin(oncotree_mapping["ONCOTREE_CODE"]))
]
return unmapped_oncotrees.index
def _validate_oncotree_code_mapping_message(
self: "Clinical",
clinicaldf: pd.DataFrame,
unmapped_oncotree_indices: pd.DataFrame,
) -> Tuple[str, str]:
"""This function returns the error and warning messages
if the input clinical data has row indices with unmapped
oncotree codes
Args:
clinicaldf (pd.DataFrame): input clinical data
unmapped_oncotree_indices (pd.DataFrame): row indices of the
input clinical data with unmapped oncotree codes
Returns:
Tuple[str, str]: error message that tells you how many
samples AND the unique unmapped oncotree codes that your
input clinical data has
"""
errors = ""
warnings = ""
if len(unmapped_oncotree_indices) > 0:
# sort the unique unmapped oncotree codes
unmapped_oncotree_codes = sorted(
set(clinicaldf.loc[unmapped_oncotree_indices]["ONCOTREE_CODE"])
)
errors = (
"Sample Clinical File: Please double check that all your "
"ONCOTREE CODES exist in the mapping. You have {} samples "
"that don't map. These are the codes that "
"don't map: {}\n".format(
len(unmapped_oncotree_indices), ",".join(unmapped_oncotree_codes)
)
)
return errors, warnings
and we might need to output warnings in the future.

Copy link

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few formatting suggestions, otherwise the business logic looks good!

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one comment

"Patient Clinical File: Please double check your INT_DOD and INT_CONTACT columns. "
"INT_DOD must be >= INT_CONTACT. "
f"There are {len(invalid_int_dod_indices)} row(s) with INT_DOD < INT_CONTACT. "
f"Row {invalid_int_dod_indices.tolist()} contain invalid values in the INT_DOD field. Please correct.\n"
Copy link
Contributor

@rxu17 rxu17 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the message is bit misleading because it implies that the INT_DOD field is the issue when it could be INT_CONTACT. You could rephrase this part of the error message to be like:
The row number(s) this occurs in are: {invalid_year_death_indices.tolist()}. Please correct.

I think this is happening for the YEAR_DEATH AND YEAR_CONTACT validation rule function too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Please fix the above in this PR as well. Thanks!

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Fantastic work, and great code reviews! Thanks team - will pre-approve.

Copy link

Quality Gate Passed Quality Gate passed

Issues
8 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@danlu1 danlu1 merged commit 4c0f625 into develop Apr 28, 2024
13 checks passed
@rxu17 rxu17 deleted the gen-867-add-int-dod-validation branch May 1, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants