-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEN-867] Add validation rule to check if INT_DOD >= INT_CONTACT #561
Conversation
Returns: | ||
pd.Index: The row indices of the row with INT_DOD < INT_CONTACT in the input clinical data | ||
""" | ||
# Generate temp dataframe to handle datatype mismatch in a column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to limit the amount of in-line comments you add. In a lot of cases it has the unintended affect of reducing readability. The code you wrote does well at telling us what the code is doing. That should be the source of truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do.
genie_registry/clinical.py
Outdated
# Convert INT_DOD and INT_CONTACT to numeric, coercing errors to NaN | ||
temp["INT_DOD"] = pd.to_numeric(temp["INT_DOD"], errors="coerce") | ||
temp["INT_CONTACT"] = pd.to_numeric(temp["INT_CONTACT"], errors="coerce") | ||
# Compare rows with numeric values in both columns and returns comparion results("True"/"False") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inline comment is misleading. You are only checking for False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm let me rephrase this.
|
||
|
||
def _check_int_dod_validity_message( | ||
invalid_int_dod_indices: pd.Index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the type here a list of pd.Index? The length check below suggests it's an iterator of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also just be the pandas specific way of doing things that is not super intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
if has_int_dod_and_contact: | ||
invalid_int_dod_indices = _check_int_dod_validity(clinicaldf) | ||
errors, warnings = _check_int_dod_validity_message(invalid_int_dod_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
errors, _
As warnings is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adapted from
Genie/genie_registry/clinical.py
Lines 475 to 535 in 3f31802
def _validate_oncotree_code_mapping( | |
self: "Clinical", clinicaldf: pd.DataFrame, oncotree_mapping: pd.DataFrame | |
) -> pd.Index: | |
"""Checks that the oncotree codes in the input clinical | |
data is a valid oncotree code from the official oncotree site | |
Args: | |
clinicaldf (pd.DataFrame): clinical input data to validate | |
oncotree_mapping (pd.DataFrame): table of official oncotree | |
mappings | |
Returns: | |
pd.Index: row indices of unmapped oncotree codes in the | |
input clinical data | |
""" | |
# Make oncotree codes uppercase (SpCC/SPCC) | |
clinicaldf["ONCOTREE_CODE"] = ( | |
clinicaldf["ONCOTREE_CODE"].astype(str).str.upper() | |
) | |
unmapped_oncotrees = clinicaldf[ | |
(clinicaldf["ONCOTREE_CODE"] != "UNKNOWN") | |
& ~(clinicaldf["ONCOTREE_CODE"].isin(oncotree_mapping["ONCOTREE_CODE"])) | |
] | |
return unmapped_oncotrees.index | |
def _validate_oncotree_code_mapping_message( | |
self: "Clinical", | |
clinicaldf: pd.DataFrame, | |
unmapped_oncotree_indices: pd.DataFrame, | |
) -> Tuple[str, str]: | |
"""This function returns the error and warning messages | |
if the input clinical data has row indices with unmapped | |
oncotree codes | |
Args: | |
clinicaldf (pd.DataFrame): input clinical data | |
unmapped_oncotree_indices (pd.DataFrame): row indices of the | |
input clinical data with unmapped oncotree codes | |
Returns: | |
Tuple[str, str]: error message that tells you how many | |
samples AND the unique unmapped oncotree codes that your | |
input clinical data has | |
""" | |
errors = "" | |
warnings = "" | |
if len(unmapped_oncotree_indices) > 0: | |
# sort the unique unmapped oncotree codes | |
unmapped_oncotree_codes = sorted( | |
set(clinicaldf.loc[unmapped_oncotree_indices]["ONCOTREE_CODE"]) | |
) | |
errors = ( | |
"Sample Clinical File: Please double check that all your " | |
"ONCOTREE CODES exist in the mapping. You have {} samples " | |
"that don't map. These are the codes that " | |
"don't map: {}\n".format( | |
len(unmapped_oncotree_indices), ",".join(unmapped_oncotree_codes) | |
) | |
) | |
return errors, warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few formatting suggestions, otherwise the business logic looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one comment
genie_registry/clinical.py
Outdated
"Patient Clinical File: Please double check your INT_DOD and INT_CONTACT columns. " | ||
"INT_DOD must be >= INT_CONTACT. " | ||
f"There are {len(invalid_int_dod_indices)} row(s) with INT_DOD < INT_CONTACT. " | ||
f"Row {invalid_int_dod_indices.tolist()} contain invalid values in the INT_DOD field. Please correct.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the message is bit misleading because it implies that the INT_DOD
field is the issue when it could be INT_CONTACT
. You could rephrase this part of the error message to be like:
The row number(s) this occurs in are: {invalid_year_death_indices.tolist()}. Please correct.
I think this is happening for the YEAR_DEATH
AND YEAR_CONTACT
validation rule function too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Please fix the above in this PR as well. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Fantastic work, and great code reviews! Thanks team - will pre-approve.
|
Problem:
Lack of validation check on whether INT_DOD >= INT_CONTACT
Solution:
Add the validation check to compare INT_DOD and INT_CONTACT when both of them exist and not NAs.
Testing:
Unit test has been added.