Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving multiple fields at once? #313

Open
rajwanir opened this issue Jan 27, 2025 · 3 comments
Open

Retrieving multiple fields at once? #313

rajwanir opened this issue Jan 27, 2025 · 3 comments

Comments

@rajwanir
Copy link

Thanks for cyvcf2. It is really fast for a python API for handling VCF.

Is there a way to retrieve multiple fields at once using the variant.format("X") function using a list or any other way. For example, variant.format(["X","Y"])

I observe that retrieving INFO and genotype fields is really fast. However, format fields are slow specially on VCF/BCF with large number of samples (100,000+).

For example:

timeit(lambda: [variant.INFO.get("ALLELE_A"),variant.INFO.get("ALLELE_B"),variant.genotypes],number=1)
2.5551998987793922e-05

However,

timeit(lambda: [variant.INFO.get("ALLELE_A"),variant.INFO.get("ALLELE_B"),variant.genotypes,variant.format("X")],number=1)
0.0012662489898502827

I understand some of these limitations might be inherent to htslib. Any chance that it may be feasible to at least retrieve multiple format fields for a sample at once?

@brentp
Copy link
Owner

brentp commented Jan 28, 2025

hi, yes, this is due to how stuff is stored in htslib (and in part due to how it's accessed by cyvcf2).
when you read variant.format('X') with 100K samples, it will make a numpy array of length 100K.

it's possible to avoid some overhead if you want only a single sample, but that is not implemented in cyvcf2.

@rajwanir
Copy link
Author

Thanks @brentp for the clarification. Yeah, I tried retrieving a single sample using the VCF(vcf_file, samples=..) option but didn't see much improvement.

Retrieving sample format fields with large BCF files becomes a performance pain point (100,000+ samples * 700,00 variants). Could parallelize by records to speed up, however, loading everything overloads the memory. Still looking into some ideas if such retrieval can be done more efficiently.

The built-in threads option is simply compression/decompression threads right? I didn't see any difference with that either.

Thank you.

@brentp
Copy link
Owner

brentp commented Jan 30, 2025

yes, the threads are for decompression. to get more speed you'll need an alternative data format, not BCF.
also, BCF will be much faster than VCF but still rough for that many samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants