Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for new filenames before merge/subset #856

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sheridancbio
Copy link
Contributor

@sheridancbio sheridancbio commented Jul 19, 2021

check for new filenames before merge/subset

python logic written in separate library
- looks for any filenames present in the current datatype sheet (embedded in script)
- excludes any filenames which were in use before the introduction of updated filenames
- ignore normal sample files (which are ignored by importer)
- determines cancer study id for proper handling of <CANCER_STUDY> placeholder
- matches filenames in directory against wildcard patterns (up to a single asterisk)
perform check from subset-impact-data.sh and merge.py and exit on noticing new filename patterns.

@sheridancbio sheridancbio force-pushed the test_for_new_filenames_in_scripts branch 3 times, most recently from 2e47458 to 0b2cf4c Compare July 20, 2021 14:35
"""

def filename_maps():
datatype_lines = datatype_column_content().splitlines()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially just set this up as lists to start with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the full datatypes sheet content (tab separated values) is now embedded in the script, and the columns are parsed through tab separation.

Copy link
Collaborator

@averyniceday averyniceday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just hardcode stuff LOL because of reasons we've talked about!

@sheridancbio sheridancbio force-pushed the test_for_new_filenames_in_scripts branch 2 times, most recently from edd2d4f to 0ad4771 Compare July 21, 2021 18:20
Copy link
Collaborator

@averyniceday averyniceday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check for seg files

@sheridancbio sheridancbio force-pushed the test_for_new_filenames_in_scripts branch from 0ad4771 to 9b8cf01 Compare July 22, 2021 20:03
@sheridancbio
Copy link
Contributor Author

I've added proper checking of wildcard filename patterns .. such as : data_mutations_uniprot_canonical_transcripts_*.txt (a newly added filename pattern)

@sheridancbio sheridancbio force-pushed the test_for_new_filenames_in_scripts branch 8 times, most recently from 8770d31 to bfacf80 Compare July 26, 2021 13:33
for line in file.readlines():
line_without_whitespace = "".join(line.split()).strip()
if line.startswith(CANCER_STUDY_IDENTIFIER_PREFIX):
cancer_study_identifier = line[len(CANCER_STUDY_IDENTIFIER_PREFIX):].strip()
Copy link
Collaborator

@averyniceday averyniceday Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest just splitting the line based on colons and take the second part e.g line.split(':')[1]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

cancer_study_identifier = line.split(":")[1].strip() (there should be a strip for both front and back)
if cancer_study_identifier:
blahblahblahblabha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the possibility of multiple colons on the line (some colons occurring in the value), we would have to join together all parts of the list after the first element, such as
":".join(split_line_fields[1:])

So we chose to recode this by splitting at the position of the first occurrence of ":"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking again, the code already discards all whitespace on the line (from the meta file). It does this before tests are performed, so the prefix "cancer_study_identifier:" would be seen and could be stripped off of the line based on length, which would yield just the value string (minus any spaces). Since cancer studies should have no spaces, there should be no problem. So I think I'll keep the code as is.

@sheridancbio sheridancbio force-pushed the test_for_new_filenames_in_scripts branch 3 times, most recently from 070f42f to 0b61dfb Compare July 26, 2021 20:37
@sheridancbio
Copy link
Contributor Author

We decided together that the effort to properly determine the cancer_study_identifer was not feasible in all use cases. The pdx integration tests for example perform a merge step on derived (subsetted) studies which do not contain the needed properties in meta files. Also, the cancer_study_identifier which is passed into the merge.py script was seen to be the identifier for the output study .. not the input study. So it could not be used to validate the input filenames after all for filenames of pattern <CANCER_STUDY>_data_cna_hg19.seg. So we have punted on detection of "correctness" based on the cancer_study_identifier. The check was still valuable for other filename patterns however, so it is being merged as a check for merge.py and subset_impact_data.sh - without the processing of cancer_study_identifier.

returns 0 if novel/updated filenames are preset, 1 otherwise
"""
if study_directory_uses_updated_filenames(args.cancer_study_path):
sys.exit(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swap these two exit codes

# check for presence of new filename pattens (not yet supported) and fail if present

$PYTHON_BINARY $PORTAL_SCRIPTS_DIRECTORY/study_directory_uses_updated_filenames.py -d $INPUT_DIRECTORY
if [ $? -eq 0 ] ; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking for -eq 0 (which is technically working right now for the main test, but we still have a nonzero exit code for a failed test in one of the other checks)

to keep it simple, all failed tests (meaning new files present, or anything else) should stick to a non-zero exit code and the check can be for -ne 0

Copy link
Collaborator

@averyniceday averyniceday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix exit codes!

python logic written in separate library
- looks for any filenames present in the current datatype sheet (embedded in script)
- excludes any filenames which were in use before the introduction of updated filenames
- ignore normal sample files (which are ignored by importer)
- matches filenames in directory against wildcard patterns (up to a single asterisk)
- all filenames of the form *_data_cna_hg18.seg will be allowed (hg18/hg19 and data/meta) - these will not prevent the merge/subset scripts from running (determining cancer_study_id in all cases was not feasible)
perform check from subset-impact-data.sh and merge.py and exit on noticing new filename patterns present.
@sheridancbio sheridancbio force-pushed the test_for_new_filenames_in_scripts branch from 0b61dfb to 47c8c67 Compare July 27, 2021 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants