[GEN-867] Add validation rule to check if INT_DOD >= INT_CONTACT #561

danlu1 · 2024-04-25T22:43:47Z

Problem:

Lack of validation check on whether INT_DOD >= INT_CONTACT

Solution:

Add the validation check to compare INT_DOD and INT_CONTACT when both of them exist and not NAs.

Testing:
Unit test has been added.

BryanFauble · 2024-04-25T22:58:26Z

genie_registry/clinical.py

+    Returns:
+        pd.Index: The row indices of the row with INT_DOD < INT_CONTACT in the input clinical data
+    """
+    # Generate temp dataframe to handle datatype mismatch in a column


Try to limit the amount of in-line comments you add. In a lot of cases it has the unintended affect of reducing readability. The code you wrote does well at telling us what the code is doing. That should be the source of truth.

Sure. Will do.

BryanFauble · 2024-04-25T22:59:53Z

genie_registry/clinical.py

+    # Convert INT_DOD and INT_CONTACT to numeric, coercing errors to NaN
+    temp["INT_DOD"] = pd.to_numeric(temp["INT_DOD"], errors="coerce")
+    temp["INT_CONTACT"] = pd.to_numeric(temp["INT_CONTACT"], errors="coerce")
+    # Compare rows with numeric values in both columns and returns comparion results("True"/"False")


This inline comment is misleading. You are only checking for False.

Hmmm let me rephrase this.

BryanFauble · 2024-04-25T23:03:34Z

genie_registry/clinical.py

+
+
+def _check_int_dod_validity_message(
+    invalid_int_dod_indices: pd.Index,


Is the type here a list of pd.Index? The length check below suggests it's an iterator of values.

Might also just be the pandas specific way of doing things that is not super intuitive

The type is pandas.core.indexes.range.RangeIndex. We can either use len or size.

BryanFauble · 2024-04-25T23:08:57Z

genie_registry/clinical.py

+        )
+        if has_int_dod_and_contact:
+            invalid_int_dod_indices = _check_int_dod_validity(clinicaldf)
+            errors, warnings = _check_int_dod_validity_message(invalid_int_dod_indices)


Nit:
errors, _
As warnings is not used

This is adapted from

Genie/genie_registry/clinical.py

Lines 475 to 535 in 3f31802

def _validate_oncotree_code_mapping(

self: "Clinical", clinicaldf: pd.DataFrame, oncotree_mapping: pd.DataFrame

) -> pd.Index:

"""Checks that the oncotree codes in the input clinical

data is a valid oncotree code from the official oncotree site

Args:

clinicaldf (pd.DataFrame): clinical input data to validate

oncotree_mapping (pd.DataFrame): table of official oncotree

mappings

Returns:

pd.Index: row indices of unmapped oncotree codes in the

input clinical data

"""

# Make oncotree codes uppercase (SpCC/SPCC)

clinicaldf["ONCOTREE_CODE"] = (

clinicaldf["ONCOTREE_CODE"].astype(str).str.upper()

)

unmapped_oncotrees = clinicaldf[

(clinicaldf["ONCOTREE_CODE"] != "UNKNOWN")

& ~(clinicaldf["ONCOTREE_CODE"].isin(oncotree_mapping["ONCOTREE_CODE"]))

]

return unmapped_oncotrees.index

def _validate_oncotree_code_mapping_message(

self: "Clinical",

clinicaldf: pd.DataFrame,

unmapped_oncotree_indices: pd.DataFrame,

) -> Tuple[str, str]:

"""This function returns the error and warning messages

if the input clinical data has row indices with unmapped

oncotree codes

Args:

clinicaldf (pd.DataFrame): input clinical data

unmapped_oncotree_indices (pd.DataFrame): row indices of the

input clinical data with unmapped oncotree codes

Returns:

Tuple[str, str]: error message that tells you how many

samples AND the unique unmapped oncotree codes that your

input clinical data has

"""

errors = ""

warnings = ""

if len(unmapped_oncotree_indices) > 0:

# sort the unique unmapped oncotree codes

unmapped_oncotree_codes = sorted(

set(clinicaldf.loc[unmapped_oncotree_indices]["ONCOTREE_CODE"])

)

errors = (

"Sample Clinical File: Please double check that all your "

"ONCOTREE CODES exist in the mapping. You have {} samples "

"that don't map. These are the codes that "

"don't map: {}\n".format(

len(unmapped_oncotree_indices), ",".join(unmapped_oncotree_codes)

)

)

return errors, warnings

and we might need to output warnings in the future.

BryanFauble

A few formatting suggestions, otherwise the business logic looks good!

rxu17

LGTM! Just one comment

rxu17 · 2024-04-26T02:31:46Z

genie_registry/clinical.py

+            "Patient Clinical File: Please double check your INT_DOD and INT_CONTACT columns. "
+            "INT_DOD must be >= INT_CONTACT. "
+            f"There are {len(invalid_int_dod_indices)} row(s) with INT_DOD < INT_CONTACT. "
+            f"Row {invalid_int_dod_indices.tolist()} contain invalid values in the INT_DOD field. Please correct.\n"


This part of the message is bit misleading because it implies that the INT_DOD field is the issue when it could be INT_CONTACT. You could rephrase this part of the error message to be like:
The row number(s) this occurs in are: {invalid_year_death_indices.tolist()}. Please correct.

I think this is happening for the YEAR_DEATH AND YEAR_CONTACT validation rule function too.

^ Please fix the above in this PR as well. Thanks!

thomasyu888

🔥 Fantastic work, and great code reviews! Thanks team - will pre-approve.

sonarqubecloud · 2024-04-28T23:57:23Z

Quality Gate passed

Issues
8 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

add validation rule to check if INT_DOD >= INT_CONTACT

73c5e46

danlu1 requested a review from a team as a code owner April 25, 2024 22:43

BryanFauble reviewed Apr 25, 2024

View reviewed changes

BryanFauble approved these changes Apr 25, 2024

View reviewed changes

rxu17 approved these changes Apr 26, 2024

View reviewed changes

thomasyu888 approved these changes Apr 26, 2024

View reviewed changes

update inline comments and error messages

1787e57

danlu1 merged commit 4c0f625 into develop Apr 28, 2024
13 checks passed

rxu17 deleted the gen-867-add-int-dod-validation branch May 1, 2024 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GEN-867] Add validation rule to check if INT_DOD >= INT_CONTACT #561

[GEN-867] Add validation rule to check if INT_DOD >= INT_CONTACT #561

danlu1 commented Apr 25, 2024

BryanFauble Apr 25, 2024

danlu1 Apr 26, 2024

BryanFauble Apr 25, 2024

danlu1 Apr 26, 2024

BryanFauble Apr 25, 2024

BryanFauble Apr 25, 2024

danlu1 Apr 28, 2024

BryanFauble Apr 25, 2024

danlu1 Apr 28, 2024

BryanFauble left a comment

rxu17 left a comment

rxu17 Apr 26, 2024 •

edited

Loading

thomasyu888 Apr 26, 2024

thomasyu888 left a comment •

edited

Loading

sonarqubecloud bot commented Apr 28, 2024



		def _check_int_dod_validity_message(
		invalid_int_dod_indices: pd.Index,

	def _validate_oncotree_code_mapping(
	self: "Clinical", clinicaldf: pd.DataFrame, oncotree_mapping: pd.DataFrame
	) -> pd.Index:
	"""Checks that the oncotree codes in the input clinical
	data is a valid oncotree code from the official oncotree site

	Args:
	clinicaldf (pd.DataFrame): clinical input data to validate
	oncotree_mapping (pd.DataFrame): table of official oncotree
	mappings

	Returns:
	pd.Index: row indices of unmapped oncotree codes in the
	input clinical data
	"""
	# Make oncotree codes uppercase (SpCC/SPCC)
	clinicaldf["ONCOTREE_CODE"] = (
	clinicaldf["ONCOTREE_CODE"].astype(str).str.upper()
	)

	unmapped_oncotrees = clinicaldf[
	(clinicaldf["ONCOTREE_CODE"] != "UNKNOWN")
	& ~(clinicaldf["ONCOTREE_CODE"].isin(oncotree_mapping["ONCOTREE_CODE"]))
	]
	return unmapped_oncotrees.index

	def _validate_oncotree_code_mapping_message(
	self: "Clinical",
	clinicaldf: pd.DataFrame,
	unmapped_oncotree_indices: pd.DataFrame,
	) -> Tuple[str, str]:
	"""This function returns the error and warning messages
	if the input clinical data has row indices with unmapped
	oncotree codes

	Args:
	clinicaldf (pd.DataFrame): input clinical data
	unmapped_oncotree_indices (pd.DataFrame): row indices of the
	input clinical data with unmapped oncotree codes

	Returns:
	Tuple[str, str]: error message that tells you how many
	samples AND the unique unmapped oncotree codes that your
	input clinical data has
	"""
	errors = ""
	warnings = ""
	if len(unmapped_oncotree_indices) > 0:
	# sort the unique unmapped oncotree codes
	unmapped_oncotree_codes = sorted(
	set(clinicaldf.loc[unmapped_oncotree_indices]["ONCOTREE_CODE"])
	)
	errors = (
	"Sample Clinical File: Please double check that all your "
	"ONCOTREE CODES exist in the mapping. You have {} samples "
	"that don't map. These are the codes that "
	"don't map: {}\n".format(
	len(unmapped_oncotree_indices), ",".join(unmapped_oncotree_codes)
	)
	)
	return errors, warnings

[GEN-867] Add validation rule to check if INT_DOD >= INT_CONTACT #561

[GEN-867] Add validation rule to check if INT_DOD >= INT_CONTACT #561

Conversation

danlu1 commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanFauble left a comment

Choose a reason for hiding this comment

rxu17 left a comment

Choose a reason for hiding this comment

rxu17 Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasyu888 left a comment • edited Loading

Choose a reason for hiding this comment

sonarqubecloud bot commented Apr 28, 2024

Quality Gate passed

rxu17 Apr 26, 2024 •

edited

Loading

thomasyu888 left a comment •

edited

Loading