-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on absence of ? or . in MM tags #654
Comments
I support this clarification in the specification. This should only effect legacy versions of the tag, as I would hope current/new tags would include the ? or . specifier. And at Nanopore we have just recently started to output all-context 5mC calls. So all old Nanopore files will default to the ? tag version since calls are not made at any non-CG contexts. |
@jrobinso, your explanation is correct for PacBio's methylation caller, and I would support modifying the spec so that The currently released version does not specify a mode but uses |
I'm not convinced in changing the spec as using At the time of writing the spec I think I had a grand total of one real called read (it was somewhat chicken and egg), but that was called end to end and non-methylated bases were simply absent. Looking at one of the PacBio called data sets, I do indeed see it's not specifying
So the 2932 skip is almost certainly skipping an initial region (maybe related to 4533S in the alignment? I haven't counted the Cs but it seems far too many), and then it's reporting methylation status for bases that it thinks have been methylated only. That bit at least is adhering to the spec as written. I don't see many 0 confidence values in the Ml tag. So I think the problem here is the partial calling over a specific range. If we start changing to spec to fit the incorrect software then we'll probably then get a complaint from another vendor saying that by changing the defaults we've altered the meaning of their data which doesn't write out those runs of 0s. Edit: if we change this to explicit |
I was just pointing out that there is no known software or existing data files that produces files with the "." default assumption, on the contrary all known software and files in existence use the "?" default assumption. Its possible I have misread the spec, but I did I confirm my reading with @amwenger and @marcus1487 . Under the "." assumption I would assume all skipped "Cs" (e.g. the first 2932 in your example) would be known to have the modification with low probability. But you say that example was written to the spec. So perhaps some clarification to the spec is in order, rather than a change. @amwenger could you comment on that example?
Yes! That is exactly what we are saying, that is how IGV is currently interpreting the pacbio and ont data. Only "CG" sites have been investigated. Within CG site some alignments could not be called, for various reasons, and nothing can be said about them. They are also skipped. Edit:
Actually that's what "." mode means from my reading. "?" mode means nothing can be said one way or another about non-called (i.e. skipped) bases. I don't read ML as confidence, but as probability, and an ML of zero means we are confident the modification is not present. I interpret a non-call (skipped base in "?" mode) to mean we haven't checked or don't have enough confidence to make a call. Bases that are called with low probability get a bright blue color in IGV to mean these bases are probably not methylated. Non-call bases don't get any color at all.
This is the nub of the misunderstanding or issue, I am assuming exactly that, i.e. nothing can be said about those bases. BTW "been checked" is not the same as no-call. I don't think whether or not its been checked is relevant. |
Ok if it's calling only CG motifs and not just C bases then that would explain why it has lots of gaps. Is it to be expected though that almost all CGs are infact methylated? I thought generally it wasn't so common, but I'm a novice really at the biology here. I'm not sure we can say "there is no known software or existing data files that produces files with the "." default assumption". The first data I had came from the ONT basecaller which was spitting out a mod_mapped.bam file. I'm pretty syre that was matching the spec, but I could be wrong and maybe it's also looking only in CG cases? |
@marcus1487 can comment on this but all the data I have seen so far have been CG context callers. So nothing can be said about other "Cs" ("Cs" not in CG contexts). That's what prompted this issue, if I interpreted these files to the spec all of those Cs would get colors indicating they are not methylated which makes no sense. Methylation is very common but just because a call is made does not mean that the site is methylated. It could equally mean it is believed to be NOT methylated (ML < 50%). It just means that something is definitely being said about it, as opposed to nothing is known. A call with ML of 0 is information, a skipped base is not, and I think that is where the confusion is. This is a confusing topic, I struggled with it, but that seems to be how CG context calls are to be interpreted. I don't have any experience with other modifications. |
BTW the non-context specific callers, of which I have not seen any data, will use the "." explicitly to save disk space since otherwise there would be a huge number of near zero calls. To date I think almost all the release pipelines and data, at least that I have seen, are CG specific 5mc callers. |
That was the whole intention when designing the format - to avoid doubling up on space by storing an extra per-base methylation channel, potentially with many new calls in the future. Not being sufficiently familier with the biological processes though I didn't think that algorithms wouldn't even look for signal outside of CpG islands, but it makes sense given the bulk of methylation occurs there in mammals and it's an easy way to improve "accuracy" (in a somewhat biased fashion). I'm stil wary of just changing the definition this late though. Given how hard it was to reach the correct bits of the community during the lengthy review (basically I failed to do so despite trying), I don't really have much confidence we can also reach enough of it again now to really know who is doing what. Maybe someone closer to the problem can enumerate all the platforms that are capable of producing such data and describe the state of their pipelines, along with who to contact? We're not going to reach the necessary audience by posing questions here (otherwise we wouldn't be in this situation to start with). |
I understand your reluctance to change the spec. OTOH I can't change IGV to conform to the spec as it then wouldn't work with current data. And AFAIK @amwenger and @marcus1487 represent the only platforms currently capable of producing this data. The only time "." makes sense, as far as I understand, is for context free callers. I don't think there are any of these released yet, some are in development. For context specific callers (e.g. CpG) "?" is the only reasonable interpretation as there are many more "Cs" (for example) out of context than in context. In other words the ratio of skipped to called bases is high. Its not reasonable to treat these skipped bases as having been investigated. None of this matters unless you are distinguishing a no-call from a call at low probability. This is subtle but important for methylation at least. A call with low probability == a call of no methylation with high probability. This is not a huge problem as both platforms will be specifying ? or . going forward, my concern is someone someday will produce files that do conform to the spec in this regard in which case IGV (and possibly other tools) won't work properly. But that can be dealt with if it occurs. The best practice would be to always specify this and not rely on defaults. I just raised this issue as it was clear that spec and practice were out of sync, feel free to close it. |
I can confirm that all nanopore production 5mC results should use the
Native human (and most native mammalian samples) have about 50% of Cs in CG contexts modified and ~1-5% of Cs in non-CG contexts modified. Plants have more CHG methylation (I don't know enough about plant CG context methylation).
Example data provided from nanopore at the time of development should have been CG-context only. From the nanopore side I would recommend @jts and @cjw85 might want to comment. |
A clarification on results from Oxford Nanopore Sequencing basecallers. The previous generation of calling algorithm was capable of performing all-context calling. There are files in the wild where the tags follow the specification with no optional There are clearly some BAMs in the wild that do not use the optional tags but are encoding their data as if they were using the explicit Similar to @jkbonfield I don't hold with merit the argument that the specification should be changed to fit erroneous files that have been produced thus far. |
This is news to me, and changes the whole discussion. In the unlikely event a user loads such a file in IGV they will get unexpected results, but I can deal with it there. As I raised this issue I will close it. |
So this leaves us in a pickle, with both defaults being used at some point. Blergh! I think what this really means is several things:
|
I would support a change to the specification that it recommends that SAM writers explicitely state |
I'm also in favour of recommending the flag is explicit. Maybe we should write a tool that performs sanity checks on mod bams to catch these issues. It could check for hard clips (#646), that all Cs (or whatever the canonical base is) are accounted for in |
The '.' code is the default interpretation, but historically tools omitting '?' and '.' have used both styles. An explicit definition in the MM string removes any ambiguity. See samtools#654 comments for background.
The '.' code is the default interpretation, but historically tools omitting '?' and '.' have used both styles. An explicit definition in the MM string removes any ambiguity. See samtools#654 comments for background. Co-authored-by: John Marshall <[email protected]>
The '.' code is the default interpretation, but historically tools omitting '?' and '.' have used both styles. An explicit definition in the MM string removes any ambiguity. See #654 comments for background.
This issue arose implementing support for MM in IGV. There is an apparent discrepancy between the spec and current practice when interpreting MM tags which do not contain a "?" or "." flag. The spec says that when this flag is not present it should be assumed the modifications at skipped bases are present with low probability (i.e. equivalent to "."). However in files produced to date without this flag the opposite is true, i.e. nothing is known about skipped bases. Consequently this is how IGV interprets skipped bases.
So the request would be to change the spec to conform to current practice, or maybe more specifically existing files from past practice since the major vendors are updating their tools to always include either ? or .
@amwenger @marcus1487 jump in if I've misunderstood.
The text was updated successfully, but these errors were encountered: