check for new filenames before merge/subset #856

sheridancbio · 2021-07-19T16:10:58Z

check for new filenames before merge/subset

python logic written in separate library
- looks for any filenames present in the current datatype sheet (embedded in script)
- excludes any filenames which were in use before the introduction of updated filenames
- ignore normal sample files (which are ignored by importer)
- determines cancer study id for proper handling of <CANCER_STUDY> placeholder
- matches filenames in directory against wildcard patterns (up to a single asterisk)
perform check from subset-impact-data.sh and merge.py and exit on noticing new filename patterns.

import-scripts/subset-impact-data.sh

averyniceday · 2021-07-20T18:53:18Z

import-scripts/updated_filename_for_datatype_test.py

+"""
+
+def filename_maps():
+    datatype_lines = datatype_column_content().splitlines()


potentially just set this up as lists to start with?

the full datatypes sheet content (tab separated values) is now embedded in the script, and the columns are parsed through tab separation.

import-scripts/updated_filename_for_datatype_test.py

averyniceday

Just hardcode stuff LOL because of reasons we've talked about!

averyniceday

add a check for seg files

sheridancbio · 2021-07-22T20:06:51Z

I've added proper checking of wildcard filename patterns .. such as : data_mutations_uniprot_canonical_transcripts_*.txt (a newly added filename pattern)

import-scripts/subset-impact-data.sh

import-scripts/study_directory_uses_updated_filenames.py

averyniceday · 2021-07-26T15:04:40Z

import-scripts/study_directory_uses_updated_filenames.py

+    for line in file.readlines():
+        line_without_whitespace = "".join(line.split()).strip()
+        if line.startswith(CANCER_STUDY_IDENTIFIER_PREFIX):
+            cancer_study_identifier = line[len(CANCER_STUDY_IDENTIFIER_PREFIX):].strip()


I'd suggest just splitting the line based on colons and take the second part e.g line.split(':')[1]

something like

cancer_study_identifier = line.split(":")[1].strip() (there should be a strip for both front and back) if cancer_study_identifier: blahblahblahblabha

Thinking about the possibility of multiple colons on the line (some colons occurring in the value), we would have to join together all parts of the list after the first element, such as
":".join(split_line_fields[1:])

So we chose to recode this by splitting at the position of the first occurrence of ":"

After looking again, the code already discards all whitespace on the line (from the meta file). It does this before tests are performed, so the prefix "cancer_study_identifier:" would be seen and could be stripped off of the line based on length, which would yield just the value string (minus any spaces). Since cancer studies should have no spaces, there should be no problem. So I think I'll keep the code as is.

import-scripts/study_directory_uses_updated_filenames.py

sheridancbio · 2021-07-26T21:10:56Z

We decided together that the effort to properly determine the cancer_study_identifer was not feasible in all use cases. The pdx integration tests for example perform a merge step on derived (subsetted) studies which do not contain the needed properties in meta files. Also, the cancer_study_identifier which is passed into the merge.py script was seen to be the identifier for the output study .. not the input study. So it could not be used to validate the input filenames after all for filenames of pattern <CANCER_STUDY>_data_cna_hg19.seg. So we have punted on detection of "correctness" based on the cancer_study_identifier. The check was still valuable for other filename patterns however, so it is being merged as a check for merge.py and subset_impact_data.sh - without the processing of cancer_study_identifier.

averyniceday · 2021-07-27T13:56:23Z

import-scripts/study_directory_uses_updated_filenames.py

+    returns 0 if novel/updated filenames are preset, 1 otherwise
+    """
+    if study_directory_uses_updated_filenames(args.cancer_study_path):
+        sys.exit(0);


swap these two exit codes

averyniceday · 2021-07-27T13:57:23Z

import-scripts/subset-impact-data.sh

+# check for presence of new filename pattens (not yet supported) and fail if present
+
+$PYTHON_BINARY $PORTAL_SCRIPTS_DIRECTORY/study_directory_uses_updated_filenames.py -d $INPUT_DIRECTORY
+if [ $? -eq 0 ] ; then


checking for -eq 0 (which is technically working right now for the main test, but we still have a nonzero exit code for a failed test in one of the other checks)

to keep it simple, all failed tests (meaning new files present, or anything else) should stick to a non-zero exit code and the check can be for -ne 0

averyniceday

fix exit codes!

python logic written in separate library - looks for any filenames present in the current datatype sheet (embedded in script) - excludes any filenames which were in use before the introduction of updated filenames - ignore normal sample files (which are ignored by importer) - matches filenames in directory against wildcard patterns (up to a single asterisk) - all filenames of the form *_data_cna_hg18.seg will be allowed (hg18/hg19 and data/meta) - these will not prevent the merge/subset scripts from running (determining cancer_study_id in all cases was not feasible) perform check from subset-impact-data.sh and merge.py and exit on noticing new filename patterns present.

sheridancbio added enhancement import-scripts labels Jul 19, 2021

sheridancbio requested a review from averyniceday July 19, 2021 16:18

sheridancbio force-pushed the test_for_new_filenames_in_scripts branch 3 times, most recently from 2e47458 to 0b2cf4c Compare July 20, 2021 14:35

averyniceday reviewed Jul 20, 2021

View reviewed changes

import-scripts/subset-impact-data.sh Outdated Show resolved Hide resolved

averyniceday reviewed Jul 20, 2021

View reviewed changes

import-scripts/updated_filename_for_datatype_test.py Outdated Show resolved Hide resolved