Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged tool #29

Open
wants to merge 39 commits into
base: master
Choose a base branch
from
Open

Merged tool #29

wants to merge 39 commits into from

Conversation

Landbarsch
Copy link
Collaborator

Merged the Subsampling and Multiprocessing Aspects into one Skript

.DS_Store Outdated Show resolved Hide resolved
vxdetector/testlog Outdated Show resolved Hide resolved
Comment on lines 34 to 65
def sample_fastq(file_path, sample_size, sampled_indices=None):
'''Get random parts from the FASTQ file based on shared indices for paired-end reads.'''
sampled_reads = [] # List to hold sampled reads
open_func = gzip.open if file_path.endswith('.gz') else open

# Count total reads in the file by iterating line by line
with open_func(file_path, 'rt') as f:
total_reads = sum(1 for _ in f) // 4

# Adjust sample_size if it's greater than total_reads
if sample_size > total_reads:
sample_size = total_reads

if sampled_indices is None:
sampled_indices = sorted(random.sample(range(total_reads), sample_size))

with open_func(file_path, 'rt') as f:
read_idx = 0
read_buffer = []
for line in f:
read_buffer.append(line)
if len(read_buffer) == 4: # Once we have a full read
if read_idx in sampled_indices:
sampled_reads.extend(read_buffer) # Add the read if it matches the sampled index
read_buffer = [] # Clear the buffer for the next read
read_idx += 1 # Move to the next read

# Stop if we've collected enough reads
if len(sampled_reads) >= sample_size * 4:
break

return sampled_reads, sampled_indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should first merge my PR #28 to get a much faster version of this function + documentation!

open_func = gzip.open if output_path.endswith('.gz') else open # Use gzip if output is .gz

with open_func(output_path, 'wt') as f: # Write in text mode ('wt')
f.writelines(sampled_reads)


def do_statistic(result):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard for me to see what these changes are good for? Is this "just" refactoring the code OR also changing the functionality. You might want to make sure the github actions pass as they should include some unit tests

the sample_fastq method from the memory_efficient_
sampling branch. Sampled indices is no longer used
in sample_fastq, yet needed for the workflow to
funtion.
@Anna-Rehm
Copy link
Collaborator

Please remove the .DS files using .gitignore so we don't track them anymore. Additionally, do we need to track the testlog? There seems to be a module missing ("imp") according to the Git Actions output. In #30 Stefan mentioned nosetest maybe being substituted with pynose. Maybe that might solve our problem?

@Anna-Rehm
Copy link
Collaborator

It seems like pynose can't be installed using conda. You will most likely have to change parts of the test.yml

- name: Install dependencies
and add the installation of the package via pip + remove the respective package from the environment.yml at the same time.

@Landbarsch Landbarsch marked this pull request as ready for review January 31, 2025 15:41
@sjanssen2
Copy link
Member

Hi @Landbarsch ich sehe, dass Du fleißig versuchst die Tests zu korrigieren. Es scheint gar nicht so einfach zu sein. Vielleicht liegt es daran, dass Du zu viele Änderungen in einem versuchst zu beherrschen.

Ich sehe z.B. Änderungen in den Funktionen workflow und do_statistic. Aber haben die direkt etwas miteinander zu tun? Vielleicht machst Du erstmal noch einen neuen Branch auf, in dem Du nur die Änderungen an do_statistic versuchst einzupflegen. Wenn das klappt, könnte man diese Änderungen in den Branch hier hinein mergen und hätte einen übersichtlicheren Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants