Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New tool: tskit2zarr #232

Open
jeromekelleher opened this issue Jun 3, 2024 · 8 comments
Open

New tool: tskit2zarr #232

jeromekelleher opened this issue Jun 3, 2024 · 8 comments

Comments

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Jun 3, 2024

It would be useful for simulation and other applications to be able to efficiently convert tskit tree sequences to VCF Zarr. While this can currently be done with tskit vcf and vcf2zarr it's not very efficient, and some information is lost in the VCF conversion.

The CLI would look something like

tskit2zarr convert input.trees output.vcz

we would include the standard options for multiple workers, etc. I don't think there's any need for distributed commands (but they could be added later, I guess).

Operationally, we could depend on tszip using the load function added in 0.2.3. Since tszip is basically tskit + zarr, there's no harm in adding support.

We should probably make this an optional dependency, though, since it is a relatively niche use-case?

@jeromekelleher
Copy link
Contributor Author

cc @hyanwong - as just discussed

@hyanwong
Copy link

Thanks. In my case this would basically be used as a drop-in replacement for tsinfer.SampleData.from_tree_sequence() in a notebook, so I'm wondering about access via a python API? I'm keen to be able to do this for small instances from the REPL, so I can run from a notebook without dropping into the shell.

@jeromekelleher
Copy link
Contributor Author

Yep, python API would likely be the primary interface

@hyanwong
Copy link

Great, I assume we can steal some of the code from https://github.com/tskit-dev/tsinfer/blob/9d8f9349c953da269d69b1b6a6352f10c7c5abb9/tsinfer/formats.py#L1526, although maybe we would be better starting from scratch. Either way, doing the ts->vcf->zarr gives us a reasonable way to test it, I assume.

@jeromekelleher
Copy link
Contributor Author

I don't think it would share an awful lot of code to be honest, as the writing back-end is where the complexity is. Having said that, I expect it to be quite easy, the only tricky bit is to abstract some of the code used just for converting VCFs to being a bit more general (i.e., shared across the tskit, plink and VCF tools).

@hyanwong
Copy link

Just revisiting this, which will be useful for teaching. I would very much like this to be a pure python conversion (and therefore not to depend on cyvcf2, which is e.g. not present in pyodide installations. I don't think it needs VCF reading functionality to convert a tree sequence, right?

@jeromekelleher
Copy link
Contributor Author

No, but it would involve us making vcf2zarr an optional feature of bio2zarr (i.e., not a required dependency). I think this would cause problems for lots of users, and is perhaps not worth optimising for the tskit-on-pyodide corner case.

@hyanwong
Copy link

Fair enough. If we solve tskit-dev/tsinfer#924 then that gives another potential (and perhaps faster, in-memory) route to turn a tree sequence into something that tsinfer can read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants