Retrieving multiple fields at once? #313

rajwanir · 2025-01-27T22:05:26Z

Thanks for cyvcf2. It is really fast for a python API for handling VCF.

Is there a way to retrieve multiple fields at once using the variant.format("X") function using a list or any other way. For example, variant.format(["X","Y"])

I observe that retrieving INFO and genotype fields is really fast. However, format fields are slow specially on VCF/BCF with large number of samples (100,000+).

For example:

timeit(lambda: [variant.INFO.get("ALLELE_A"),variant.INFO.get("ALLELE_B"),variant.genotypes],number=1)
2.5551998987793922e-05

However,

timeit(lambda: [variant.INFO.get("ALLELE_A"),variant.INFO.get("ALLELE_B"),variant.genotypes,variant.format("X")],number=1)
0.0012662489898502827

I understand some of these limitations might be inherent to htslib. Any chance that it may be feasible to at least retrieve multiple format fields for a sample at once?

The text was updated successfully, but these errors were encountered:

brentp · 2025-01-28T17:59:39Z

hi, yes, this is due to how stuff is stored in htslib (and in part due to how it's accessed by cyvcf2).
when you read variant.format('X') with 100K samples, it will make a numpy array of length 100K.

it's possible to avoid some overhead if you want only a single sample, but that is not implemented in cyvcf2.

rajwanir · 2025-01-28T20:30:30Z

Thanks @brentp for the clarification. Yeah, I tried retrieving a single sample using the VCF(vcf_file, samples=..) option but didn't see much improvement.

Retrieving sample format fields with large BCF files becomes a performance pain point (100,000+ samples * 700,00 variants). Could parallelize by records to speed up, however, loading everything overloads the memory. Still looking into some ideas if such retrieval can be done more efficiently.

The built-in threads option is simply compression/decompression threads right? I didn't see any difference with that either.

Thank you.

brentp · 2025-01-30T17:28:33Z

yes, the threads are for decompression. to get more speed you'll need an alternative data format, not BCF.
also, BCF will be much faster than VCF but still rough for that many samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving multiple fields at once? #313

Retrieving multiple fields at once? #313

rajwanir commented Jan 27, 2025

brentp commented Jan 28, 2025

rajwanir commented Jan 28, 2025

brentp commented Jan 30, 2025

Retrieving multiple fields at once? #313

Retrieving multiple fields at once? #313

Comments

rajwanir commented Jan 27, 2025

brentp commented Jan 28, 2025

rajwanir commented Jan 28, 2025

brentp commented Jan 30, 2025