You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for cyvcf2. It is really fast for a python API for handling VCF.
Is there a way to retrieve multiple fields at once using the variant.format("X") function using a list or any other way. For example, variant.format(["X","Y"])
I observe that retrieving INFO and genotype fields is really fast. However, format fields are slow specially on VCF/BCF with large number of samples (100,000+).
I understand some of these limitations might be inherent to htslib. Any chance that it may be feasible to at least retrieve multiple format fields for a sample at once?
The text was updated successfully, but these errors were encountered:
hi, yes, this is due to how stuff is stored in htslib (and in part due to how it's accessed by cyvcf2).
when you read variant.format('X') with 100K samples, it will make a numpy array of length 100K.
it's possible to avoid some overhead if you want only a single sample, but that is not implemented in cyvcf2.
Thanks @brentp for the clarification. Yeah, I tried retrieving a single sample using the VCF(vcf_file, samples=..) option but didn't see much improvement.
Retrieving sample format fields with large BCF files becomes a performance pain point (100,000+ samples * 700,00 variants). Could parallelize by records to speed up, however, loading everything overloads the memory. Still looking into some ideas if such retrieval can be done more efficiently.
The built-in threads option is simply compression/decompression threads right? I didn't see any difference with that either.
yes, the threads are for decompression. to get more speed you'll need an alternative data format, not BCF.
also, BCF will be much faster than VCF but still rough for that many samples.
Thanks for cyvcf2. It is really fast for a python API for handling VCF.
Is there a way to retrieve multiple fields at once using the
variant.format("X")
function using a list or any other way. For example,variant.format(["X","Y"])
I observe that retrieving INFO and genotype fields is really fast. However, format fields are slow specially on VCF/BCF with large number of samples (100,000+).
For example:
However,
I understand some of these limitations might be inherent to htslib. Any chance that it may be feasible to at least retrieve multiple format fields for a sample at once?
The text was updated successfully, but these errors were encountered: