Looping over vcf file 10x slower than reading line by line #27

biona001 · 2020-05-18T22:37:46Z

I wrote a function that reads the GT field of a vcf file into a numeric matrix where the heavy lifting is done by GeneticVariation, then uses the matrix for analyses. Upon testing, I noticed my import routine is ~4x slower than this other software, which kind of killed the performance gains in the analyses portion.

After some experiments, I came across the following benchmark on this test file
target.chr18.typedOnly.maf0.1.masked.vcf.gz:

using CodecZlib, GeneticVariation

# reads file line by line with function in Base
function read_linebyline(vcffile)
    io = GzipDecompressorStream(open(vcffile))
    s = 0
    for line in eachline(io)
        s += 1
    end
    close(io)
    s
end

# uses "for record in reader"
function loop_vcf(vcffile)
    reader = VCF.Reader(GzipDecompressorStream(open(vcffile, "r")))
    s = 0
    for record in reader
        s += 1  # here I have routines to process the record into numeric data
    end
    close(reader)
    return s
end

# uses read! and eof
function loop_vcf2(reffile)
    reader = VCF.Reader(GzipDecompressorStream(open(reffile, "r")))
    record = VCF.Record()
    s = 0
    while !eof(reader)
        read!(reader, record)
        s += 1  # here I have routines to process the record into numeric data
    end
    close(reader)
    return s
end

# timings after compilation:
tgtfile = "target.chr18.typedOnly.maf0.1.masked.vcf.gz"
@time read_linebyline(tgtfile) # 0.365078 seconds (524.71 k allocations: 176.156 MiB, 7.50% gc time)
@time loop_vcf(tgtfile)        # 4.711456 seconds (54.36 M allocations: 5.589 GiB, 10.05% gc time)
@time loop_vcf2(tgtfile)       # 2.028391 seconds (34.67 M allocations: 2.583 GiB, 13.78% gc time)

Using the loop_vcf2 approach, the read! function still takes ~80% of the time in my data import code.

My question is:

Is this expected performance?
Is there a way to modify loop_vcf or loop_vcf2 so that I can read through them faster?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looping over vcf file 10x slower than reading line by line #27

Looping over vcf file 10x slower than reading line by line #27

biona001 commented May 18, 2020 •

edited

Loading

Looping over vcf file 10x slower than reading line by line #27

Looping over vcf file 10x slower than reading line by line #27

Comments

biona001 commented May 18, 2020 • edited Loading

biona001 commented May 18, 2020 •

edited

Loading