Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't error when making an iterator on a tid not in the index #1545

Merged
merged 1 commit into from
Jan 16, 2023

Conversation

daviesrob
Copy link
Member

Don't error when making an iterator on a tid not in the index, instead return one that instantly finishes.

Fixes #1534, which was an edge case where an index did not include an entry for a chromosome that was mentioned in the file header but had no data records. Normally these would be present but empty, but it was possible to use the IDX= key in a VCF file to make an index where the chromosome simply did not appear. In this case, rather than an error, we want to return the equivalent of HTS_IDX_NONE so the iterator produces no data.

Another scenario where this is useful is if you build an index, and then try to use it immediately without first saving and reading it back in again. Such an index will have NULL entries in bidx[] for any chromosomes with no data. Again we want to return an HTS_IDX_NONE iterator if one of those chromosomes is queried. (This issue didn't usually occur because most programs are loading in an existing index, and idx_read_core() makes bidx[] entries for everything even if there's nothing in the index for the chromosome.)

Note that this changes vcf_loop() in test_view.c so that it now treats bcf_itr_querys() failures as an error. The new behaviour matches sam_loop() and is needed to detect the problem being fixed here. All the other tests still work after this change so nothing was relying on the old behaviour of ignoring the errors.

Instead return one that instantly finishes.

Fixes an edge case (issue samtools#1534) where an index did not include
an entry for a chromosome that was mentioned in the file header
but had no data records.  Normally these would be present but
empty, but it was possible to use the IDX= key in a VCF file
to make an index where the chromosome simply did not appear.  In
this case, rather than an error, we want to return the equivalent
of HTS_IDX_NONE so the iterator produces no data.

Another scenario where this is useful is if you build an index,
and then try to use it immediately without first saving and
reading it back in again.  Such an index will have NULL entries
in bidx[] for any chromosomes with no data.  Again we want to return
an HTS_IDX_NONE iterator if one of those chromosomes is queried.
(This issue didn't usually occur because most programs are loading
in an existing index, and idx_read_core() makes bidx[] entries
for everything even if there's nothing in the index for the
chromosome.)

Note that this changes vcf_loop() in test_view.c so that it
now treats bcf_itr_querys() failures as an error.  The new
behaviour matches sam_loop() and is needed to detect the problem
being fixed here.  All the other tests still work after this change
no nothing was relying on the old behaviour of ignoring the errors.
@jkbonfield jkbonfield merged commit 46cfe85 into samtools:develop Jan 16, 2023
@daviesrob daviesrob deleted the missing_tid_iter branch January 16, 2023 14:38
@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 16, 2023

Merged as it works and fixes a problem, but I do wonder if there are more things wrong than this in #1534.

It sounds like we have a way to create an index for a BCF file where chromosomes don't appear in the index. This PR makes us no longer fail with such indices, but it doesn't fix the underlying problem that we still have indices with missing chromosomes. Our code works, but what about other peoples code? Do we know whether Picard would work on such data without fixing our indices to include them?

Is it a hard requirement that indices must contain an entry for every chromosome in the header, irrespective of whether they have any coverage? I looked at the VCF spec but it has a woeful single sentence describing indices! (Basically "same as BAM" which is a bit misleading.) The BAM spec doesn't explicitly state that all references have BAI entries, but experimentally it does appear to be the case. If BCF eqvuialent doesn't do this then perhaps it is still in error.

@daviesrob
Copy link
Member Author

Having index entries with no data is normal (n_bin in the entry would be zero; look at the diagram in the samtools spec. for context). The difference in this case is that the tid value was higher than n_ref, so it wouldn't appear in the list of indicies. It's possible that Picard might not like that, but you do have to try quite hard to make an index that does this so it's unlikely many exist in practice - and as the way found involved indexing a file with no variant records such files are unlikely to be much use.

#1534 (comment) mentions that we might want to fix the index building side. I could try to make a PR for that if you'd like a belt-and-braces approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hts_itr_query fails on empty BCFs with gapped tid blocks
2 participants