You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Periodically call hflush (which calls fsync when on a file).
* This avoids the fsync being done at the bgzf_close stage,
* which can sometimes cause significant delays. As this is in
* a separate thread, spreading the sync delays throughout the
* program execution seems better.
* Frequency of 1/512 has been chosen by experimentation
* across local XFS, NFS and Lustre tests.
*/
if (++mt->flush_pending % 512==0)
if (hflush(fp->fp) !=0)
goto err;
However, when writing to certain filesystems that do not allow appending (e.g. mounted cloud storage buckets), calling fsync marks the file as "finished" and prohibits further writes to it. For instance,
$ samtools view -@4 -b /path/to/my.bam > /mnt/s3_bucket/out.bam # /mnt/s3_bucket is a bucket mounted via mountpoint-s3
samtools view: writing to standard output failed
samtools view: error reading file "/mnt/s3_bucket/out.bam"
samtools view: error closing standard output: -1
Note that this works fine in singlethreaded mode, since stdout (or any output file descriptor) only gets flushed at the very end.
I'm currently getting around this by piping the output stream through cat to absorb the fsync calls (i.e. samtools view -@4 ... | cat > ...), but it would be nice if htslib natively supported this (e.g. to allow for samtools commands with -o functionality to write directly to files). Maybe there could be an environment variable that universally disables the periodic flushing.
The text was updated successfully, but these errors were encountered:
Hm, I'd argue that mountpoint-s3's fsync() is broken if it departs so radically from the expected behaviour, although at least they do document what happens when you use it. They do also say that it should work more as expected on S3 Express One Zone, but no doubt that comes with a cost...
On the other hand, the periodic flushing was put in place to fix a problem on an ancient system which would fill the entire memory with buffered output, and then pretty much lock up solid as it tried to write all the data to storage. That was probably running a ten year old OS, so it would be worth revisiting to see if more modern systems behave better. We'd still want to call fsync() on file close though, as it provides much better assurance than close() that the data really has ended up somewhere permanent.
For performance reasons, the output stream is periodically flushed during a multithreaded BGZF write:
htslib/bgzf.c
Lines 1438 to 1449 in c705bec
However, when writing to certain filesystems that do not allow appending (e.g. mounted cloud storage buckets), calling
fsync
marks the file as "finished" and prohibits further writes to it. For instance,Note that this works fine in singlethreaded mode, since
stdout
(or any output file descriptor) only gets flushed at the very end.I'm currently getting around this by piping the output stream through
cat
to absorb thefsync
calls (i.e.samtools view -@4 ... | cat > ...
), but it would be nice if htslib natively supported this (e.g. to allow forsamtools
commands with-o
functionality to write directly to files). Maybe there could be an environment variable that universally disables the periodic flushing.The text was updated successfully, but these errors were encountered: