Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looping over vcf file 10x slower than reading line by line #27

Open
biona001 opened this issue May 18, 2020 · 0 comments
Open

Looping over vcf file 10x slower than reading line by line #27

biona001 opened this issue May 18, 2020 · 0 comments

Comments

@biona001
Copy link

biona001 commented May 18, 2020

I wrote a function that reads the GT field of a vcf file into a numeric matrix where the heavy lifting is done by GeneticVariation, then uses the matrix for analyses. Upon testing, I noticed my import routine is ~4x slower than this other software, which kind of killed the performance gains in the analyses portion.

After some experiments, I came across the following benchmark on this test file
target.chr18.typedOnly.maf0.1.masked.vcf.gz:

using CodecZlib, GeneticVariation

# reads file line by line with function in Base
function read_linebyline(vcffile)
    io = GzipDecompressorStream(open(vcffile))
    s = 0
    for line in eachline(io)
        s += 1
    end
    close(io)
    s
end

# uses "for record in reader"
function loop_vcf(vcffile)
    reader = VCF.Reader(GzipDecompressorStream(open(vcffile, "r")))
    s = 0
    for record in reader
        s += 1  # here I have routines to process the record into numeric data
    end
    close(reader)
    return s
end

# uses read! and eof
function loop_vcf2(reffile)
    reader = VCF.Reader(GzipDecompressorStream(open(reffile, "r")))
    record = VCF.Record()
    s = 0
    while !eof(reader)
        read!(reader, record)
        s += 1  # here I have routines to process the record into numeric data
    end
    close(reader)
    return s
end

# timings after compilation:
tgtfile = "target.chr18.typedOnly.maf0.1.masked.vcf.gz"
@time read_linebyline(tgtfile) # 0.365078 seconds (524.71 k allocations: 176.156 MiB, 7.50% gc time)
@time loop_vcf(tgtfile)        # 4.711456 seconds (54.36 M allocations: 5.589 GiB, 10.05% gc time)
@time loop_vcf2(tgtfile)       # 2.028391 seconds (34.67 M allocations: 2.583 GiB, 13.78% gc time)

Using the loop_vcf2 approach, the read! function still takes ~80% of the time in my data import code.

My question is:

  1. Is this expected performance?
  2. Is there a way to modify loop_vcf or loop_vcf2 so that I can read through them faster?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant