-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check for new filenames before merge/subset #856
base: master
Are you sure you want to change the base?
check for new filenames before merge/subset #856
Conversation
sheridancbio
commented
Jul 19, 2021
•
edited
Loading
edited
2e47458
to
0b2cf4c
Compare
""" | ||
|
||
def filename_maps(): | ||
datatype_lines = datatype_column_content().splitlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially just set this up as lists to start with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the full datatypes sheet content (tab separated values) is now embedded in the script, and the columns are parsed through tab separation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just hardcode stuff LOL because of reasons we've talked about!
edd2d4f
to
0ad4771
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check for seg files
0ad4771
to
9b8cf01
Compare
I've added proper checking of wildcard filename patterns .. such as : data_mutations_uniprot_canonical_transcripts_*.txt (a newly added filename pattern) |
8770d31
to
bfacf80
Compare
for line in file.readlines(): | ||
line_without_whitespace = "".join(line.split()).strip() | ||
if line.startswith(CANCER_STUDY_IDENTIFIER_PREFIX): | ||
cancer_study_identifier = line[len(CANCER_STUDY_IDENTIFIER_PREFIX):].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest just splitting the line based on colons and take the second part e.g line.split(':')[1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like
cancer_study_identifier = line.split(":")[1].strip() (there should be a strip for both front and back)
if cancer_study_identifier:
blahblahblahblabha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about the possibility of multiple colons on the line (some colons occurring in the value), we would have to join together all parts of the list after the first element, such as
":".join(split_line_fields[1:])
So we chose to recode this by splitting at the position of the first occurrence of ":"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking again, the code already discards all whitespace on the line (from the meta file). It does this before tests are performed, so the prefix "cancer_study_identifier:" would be seen and could be stripped off of the line based on length, which would yield just the value string (minus any spaces). Since cancer studies should have no spaces, there should be no problem. So I think I'll keep the code as is.
070f42f
to
0b61dfb
Compare
We decided together that the effort to properly determine the cancer_study_identifer was not feasible in all use cases. The pdx integration tests for example perform a merge step on derived (subsetted) studies which do not contain the needed properties in meta files. Also, the cancer_study_identifier which is passed into the merge.py script was seen to be the identifier for the output study .. not the input study. So it could not be used to validate the input filenames after all for filenames of pattern <CANCER_STUDY>_data_cna_hg19.seg. So we have punted on detection of "correctness" based on the cancer_study_identifier. The check was still valuable for other filename patterns however, so it is being merged as a check for merge.py and subset_impact_data.sh - without the processing of cancer_study_identifier. |
returns 0 if novel/updated filenames are preset, 1 otherwise | ||
""" | ||
if study_directory_uses_updated_filenames(args.cancer_study_path): | ||
sys.exit(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swap these two exit codes
import-scripts/subset-impact-data.sh
Outdated
# check for presence of new filename pattens (not yet supported) and fail if present | ||
|
||
$PYTHON_BINARY $PORTAL_SCRIPTS_DIRECTORY/study_directory_uses_updated_filenames.py -d $INPUT_DIRECTORY | ||
if [ $? -eq 0 ] ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking for -eq 0 (which is technically working right now for the main test, but we still have a nonzero exit code for a failed test in one of the other checks)
to keep it simple, all failed tests (meaning new files present, or anything else) should stick to a non-zero exit code and the check can be for -ne 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix exit codes!
python logic written in separate library - looks for any filenames present in the current datatype sheet (embedded in script) - excludes any filenames which were in use before the introduction of updated filenames - ignore normal sample files (which are ignored by importer) - matches filenames in directory against wildcard patterns (up to a single asterisk) - all filenames of the form *_data_cna_hg18.seg will be allowed (hg18/hg19 and data/meta) - these will not prevent the merge/subset scripts from running (determining cancer_study_id in all cases was not feasible) perform check from subset-impact-data.sh and merge.py and exit on noticing new filename patterns present.
0b61dfb
to
47c8c67
Compare