-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify MI for identifying source molecule strand #633
base: master
Are you sure you want to change the base?
Conversation
Sorry for the slow reply. Why /A and /B? That seems both redundant given we have /1 and /2 routinely used, and also rather sailing against the norm. |
A couple of thoughts @jkbonfield. I believe the Understood that this is "one piece of software", though it should be pointed out that the one piece of software is recommended by both IDT and Twinstrand Bio, the two largest players in the duplex UMI space. I personally don't care if it's called convention, recommendation, common practise or something else - just trying to give context. |
Thanks Tim. That's useful, and indeed I was confusing /1 /2 with the read numbers so already fallen into that trap. Tbh I don't really know enough about this stuff to work out what's appropriate for the spec vs what's appropriate for a specific bit of software. However clearly it's helpful for the spec to say something as it's necessary to flag this up somehow, and if you think there's realistically only the one implementation out there right now then it's a good time to codify how things should look. (This is a mistake we made before by taking too long to define barcodes!) |
I am happy to remove |
We discussed it briefly at the File Formats conference call yesterday (you're welcome to join us for such things if you wish btw, as is anyone else with a general interest). The general consensus is it perhaps needs approaching from the other direction. For example, instead of defining MI and then an additional note that it may end in /A, /B, tackle it more head on by defining MI in a structured manner of identifier plus status codes (eg I think @jmarshall has some comments to add too, hopefully along similar lines. |
As alluded to in samtools/samtools#1605 (comment) (which inspired this proposal by asking for an issue to be raised here), SAMtags.pdf currently describes MI as
You seem to be saying that in fact fgbio is using the union of e.g. (Thus The problem is that there is no defined character set for MI values, so other tools may have been using slash characters as ordinary parts of identifiers. Hence downstream tools seeing a slash in these values would not know in advance whether or not they need to strip a Is this something that only fgbio does, or do other workflows heavily using MI do something similar? What do MI values typically look like, both as generated by fgbio or by other workflows? I am not necessarily against redefining MI in this way, provided MI values found in the wild do not preclude this redefinition of slashes. But the appropriate way for the spec to phrase it would be rather different (which is why I asked for an issue rather than proposed text 😄), something like
Additional notes:
|
In As for your suggestions, I've commented below:
Removed
Changing it to a two character suffix, the slash and single character.
Done. |
|
@jkbonfield @jmarshall do you have any more guidance on this? We merged samtools/samtools#1605 which supports the strand of the source molecule in the template:coordinate sort. |
@jmarshall makes a good point about modifying The SAMtags doc at the moment clearly states:
This is somewhat ambiguous, as a compound X/Y tag does still mean the UMI can identify the molecule X despite having different tags, but only by using additional formatting knowledge. However it then goes on to say
So this is explicit in permitting UMI to imply duplication through something other than a naive strict string comparison. This PR does appear to be in agreement with that original intention, even though it is at odds with the wording in the MI tag itself. This isn't an ideal starting point obviously. I see your PR modifies both the introductory text I quoted above, as well as the MI tag definition itself. That seems reasonable. Specifically related is totally woolly and this PR now clarifies how this relation is codified. |
Especially as the related samtools PR has since been merged, the ship has long since sailed on weighing up altering MI vs adding e.g. MS. I fixed up the formatting previously so this can be reviewed more easily. IMHO the PR's direction is acceptable; I have some clarifications to the proposed text that I would like to make. |
For duplex sequencing technology, we can identify the strand (top or bottom) from which the read was derived relative to the duplex source molecule. This convention or recommendation would allow the strand information to be appended to the MI tag. This is already being used in the wild by fgbio and their users.
Related to samtools/samtools#1605.