-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New tool: tskit2zarr #232
Comments
cc @hyanwong - as just discussed |
Thanks. In my case this would basically be used as a drop-in replacement for |
Yep, python API would likely be the primary interface |
Great, I assume we can steal some of the code from https://github.com/tskit-dev/tsinfer/blob/9d8f9349c953da269d69b1b6a6352f10c7c5abb9/tsinfer/formats.py#L1526, although maybe we would be better starting from scratch. Either way, doing the ts->vcf->zarr gives us a reasonable way to test it, I assume. |
I don't think it would share an awful lot of code to be honest, as the writing back-end is where the complexity is. Having said that, I expect it to be quite easy, the only tricky bit is to abstract some of the code used just for converting VCFs to being a bit more general (i.e., shared across the tskit, plink and VCF tools). |
Just revisiting this, which will be useful for teaching. I would very much like this to be a pure python conversion (and therefore not to depend on cyvcf2, which is e.g. not present in pyodide installations. I don't think it needs VCF reading functionality to convert a tree sequence, right? |
No, but it would involve us making vcf2zarr an optional feature of bio2zarr (i.e., not a required dependency). I think this would cause problems for lots of users, and is perhaps not worth optimising for the tskit-on-pyodide corner case. |
Fair enough. If we solve tskit-dev/tsinfer#924 then that gives another potential (and perhaps faster, in-memory) route to turn a tree sequence into something that tsinfer can read. |
It would be useful for simulation and other applications to be able to efficiently convert tskit tree sequences to VCF Zarr. While this can currently be done with
tskit vcf
andvcf2zarr
it's not very efficient, and some information is lost in the VCF conversion.The CLI would look something like
we would include the standard options for multiple workers, etc. I don't think there's any need for distributed commands (but they could be added later, I guess).
Operationally, we could depend on tszip using the load function added in 0.2.3. Since tszip is basically tskit + zarr, there's no harm in adding support.
We should probably make this an optional dependency, though, since it is a relatively niche use-case?
The text was updated successfully, but these errors were encountered: