You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote a function that reads the GT field of a vcf file into a numeric matrix where the heavy lifting is done by GeneticVariation, then uses the matrix for analyses. Upon testing, I noticed my import routine is ~4x slower than this other software, which kind of killed the performance gains in the analyses portion.
using CodecZlib, GeneticVariation
# reads file line by line with function in Basefunctionread_linebyline(vcffile)
io =GzipDecompressorStream(open(vcffile))
s =0for line ineachline(io)
s +=1endclose(io)
s
end# uses "for record in reader"functionloop_vcf(vcffile)
reader = VCF.Reader(GzipDecompressorStream(open(vcffile, "r")))
s =0for record in reader
s +=1# here I have routines to process the record into numeric dataendclose(reader)
return s
end# uses read! and eoffunctionloop_vcf2(reffile)
reader = VCF.Reader(GzipDecompressorStream(open(reffile, "r")))
record = VCF.Record()
s =0while!eof(reader)
read!(reader, record)
s +=1# here I have routines to process the record into numeric dataendclose(reader)
return s
end# timings after compilation:
tgtfile ="target.chr18.typedOnly.maf0.1.masked.vcf.gz"@timeread_linebyline(tgtfile) # 0.365078 seconds (524.71 k allocations: 176.156 MiB, 7.50% gc time)@timeloop_vcf(tgtfile) # 4.711456 seconds (54.36 M allocations: 5.589 GiB, 10.05% gc time)@timeloop_vcf2(tgtfile) # 2.028391 seconds (34.67 M allocations: 2.583 GiB, 13.78% gc time)
Using the loop_vcf2 approach, the read! function still takes ~80% of the time in my data import code.
My question is:
Is this expected performance?
Is there a way to modify loop_vcf or loop_vcf2 so that I can read through them faster?
The text was updated successfully, but these errors were encountered:
I wrote a function that reads the GT field of a
vcf
file into a numeric matrix where the heavy lifting is done by GeneticVariation, then uses the matrix for analyses. Upon testing, I noticed my import routine is ~4x slower than this other software, which kind of killed the performance gains in the analyses portion.After some experiments, I came across the following benchmark on this test file
target.chr18.typedOnly.maf0.1.masked.vcf.gz:
Using the
loop_vcf2
approach, theread!
function still takes ~80% of the time in my data import code.My question is:
loop_vcf
orloop_vcf2
so that I can read through them faster?The text was updated successfully, but these errors were encountered: