-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged tool #29
base: master
Are you sure you want to change the base?
Merged tool #29
Conversation
code. made an extra file for the sampler version
for a complete Tool
vxdetector/VXdetector.py
Outdated
def sample_fastq(file_path, sample_size, sampled_indices=None): | ||
'''Get random parts from the FASTQ file based on shared indices for paired-end reads.''' | ||
sampled_reads = [] # List to hold sampled reads | ||
open_func = gzip.open if file_path.endswith('.gz') else open | ||
|
||
# Count total reads in the file by iterating line by line | ||
with open_func(file_path, 'rt') as f: | ||
total_reads = sum(1 for _ in f) // 4 | ||
|
||
# Adjust sample_size if it's greater than total_reads | ||
if sample_size > total_reads: | ||
sample_size = total_reads | ||
|
||
if sampled_indices is None: | ||
sampled_indices = sorted(random.sample(range(total_reads), sample_size)) | ||
|
||
with open_func(file_path, 'rt') as f: | ||
read_idx = 0 | ||
read_buffer = [] | ||
for line in f: | ||
read_buffer.append(line) | ||
if len(read_buffer) == 4: # Once we have a full read | ||
if read_idx in sampled_indices: | ||
sampled_reads.extend(read_buffer) # Add the read if it matches the sampled index | ||
read_buffer = [] # Clear the buffer for the next read | ||
read_idx += 1 # Move to the next read | ||
|
||
# Stop if we've collected enough reads | ||
if len(sampled_reads) >= sample_size * 4: | ||
break | ||
|
||
return sampled_reads, sampled_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should first merge my PR #28 to get a much faster version of this function + documentation!
open_func = gzip.open if output_path.endswith('.gz') else open # Use gzip if output is .gz | ||
|
||
with open_func(output_path, 'wt') as f: # Write in text mode ('wt') | ||
f.writelines(sampled_reads) | ||
|
||
|
||
def do_statistic(result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's hard for me to see what these changes are good for? Is this "just" refactoring the code OR also changing the functionality. You might want to make sure the github actions pass as they should include some unit tests
the sample_fastq method from the memory_efficient_ sampling branch. Sampled indices is no longer used in sample_fastq, yet needed for the workflow to funtion.
Please remove the .DS files using .gitignore so we don't track them anymore. Additionally, do we need to track the testlog? There seems to be a module missing ("imp") according to the Git Actions output. In #30 Stefan mentioned nosetest maybe being substituted with pynose. Maybe that might solve our problem? |
It seems like pynose can't be installed using conda. You will most likely have to change parts of the test.yml
|
Hi @Landbarsch ich sehe, dass Du fleißig versuchst die Tests zu korrigieren. Es scheint gar nicht so einfach zu sein. Vielleicht liegt es daran, dass Du zu viele Änderungen in einem versuchst zu beherrschen. Ich sehe z.B. Änderungen in den Funktionen |
Merged the Subsampling and Multiprocessing Aspects into one Skript