Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arranging individuals in a poseidon package #300

Closed
TCLamnidis opened this issue May 29, 2024 · 11 comments
Closed

Arranging individuals in a poseidon package #300

TCLamnidis opened this issue May 29, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@TCLamnidis
Copy link
Member

TCLamnidis commented May 29, 2024

A feature in trident where it can arrange the individuals in a poseidon package by the Poseidon_ID or Group_ID would be nice to have.

Practically, the operation is somewhat superfluous, as most tools do not care about the order of individuals on input. However, being able to rearrange the janno file in some basic ways without breaking the package can be very helpful for basic categorisation of information. Currently, changing the order of individuals in the janno is prohibited without also changing the order in the fam/ind file, which in turn means also changing the order in the bed/geno file.

This is a problem when dealing with reviewing PRs that add or update packages where the information is not in an intuitive order. Minotaur suffers from this problem particularly, because nextflow's asynchronous nature means that the order of individuals in the package will be different with each rerun of a package, and will always be random unless steps are taken to reorder janno/ind/geno files during package creation.

The minotaur-specific issue of order when rerunning packages is somewhat mitigated by jannocoalesce, but reordering of individuals is a data manipulation that might be useful for many users, and it makes sense to do within trident since it requires fiddling with the actual genotype data as well as the janno.

I assume this is technically already possible through some complicated forge command, where a user specifies all the poseidon_IDs in <> and in the desired order, but having an option or subcommand to sort the package without needing to explicitly state the order would be a nice quality of life feature imo.

@TCLamnidis TCLamnidis added the enhancement New feature or request label May 29, 2024
@nevrome
Copy link
Member

nevrome commented May 29, 2024

I assume this is technically already possible through some complicated forge command, where a user specifies all the poseidon_IDs in <> and in the desired order

If I'm not mistaken this is not the case. The order in forgeScript does not translate to the output order. It's a selection language only. The sample order in the forge output instead depends on the order in the source packages and the order by which these packages are discovered in the file system via -d. The forge algorithm just checks for each input individual if they conform to the entity specification in the forgeScript, but does not change their order from the input sequence.

We discussed the idea to change this behaviour in the past in #230, but it would be a big change. @stschiff wrote back then:

In full generality, this would require a complete rewrite of how we currently read the forge DSL.

Back then we decided to postpone this.

Now regarding a pure arrange or sort subcommand my gut feeling is that this would be more easy to realize 🤔. We could allow to sort by a list of .janno columns. Entirely custom ordering would then be possible by adding an own column with the desired ranks.

@TCLamnidis
Copy link
Member Author

That would cover my use case perfectly, and then some!

@stschiff
Copy link
Member

I completely agree that we need this. And I think it should be an option for forge. I actually think that adding an option --ordered to forge should be possible and not too complicated. I would like to look into it.

@TCLamnidis
Copy link
Member Author

i think a subcommand that rearranged the individuals while adding a changelog entry and incrementing the patch version also makes sense. afaik forge always creates a new package, potentially losing any info that was in the YAML file, right?

@stschiff
Copy link
Member

Good point. I mean, the functionality is the bit that we need to look into. The API can be discussed. I can also imagine both. If we have a rearrange command we might as well add --ordered

@stschiff stschiff mentioned this issue May 31, 2024
@stschiff
Copy link
Member

I've submitted a PR (#301). Was easier than I thought, at least for forge. I haven't yet added a new arrange command, so this issue should remain open even if #301 gets merged.

@nevrome
Copy link
Member

nevrome commented Jun 2, 2024

Brilliant, @stschiff!

Regarding the bigger workflow of rearranging the individuals in a package according to some property I would like to point out that qjanno allows to generate the required ordered forgeScript files in just a single line of code:

qjanno "SELECT '<'||Poseidon_ID||'>' FROM d(2018_Lamnidis_Fennoscandia) ORDER BY Poseidon_ID DESC" --raw --noOutHeader

Maybe the recommended workflow to reorder a package should indeed be as follows:

  • generate forgeFile with qjanno or any other tool
  • use forge with the --ordered flag to create the sorted package
  • apply rectify to change the checksum, add the changelog and increment the package version number

An in-place reordering of the package where data does not get copied is not a good idea, @TCLamnidis. This can easily lead to corrupted packages and data loss. So using forge for this is overall a good idea, I think.

The only issue not covered by this workflow is the loss of the .yml file information. For this I wonder if we should add yet another flag to forge. It could be called --preservePyml (we already have a --pyml option in validate). It could be set to fail if the forge command operates on more than one source package.

@stschiff
Copy link
Member

stschiff commented Jun 3, 2024

yes, I was thinking along the same lines. I was even wondering whether we should make it a default that forge always copies the old YAML, README and CHANGELOG as long as there is only a single source package? Wouldn't that make sense and be the expected behaviour? We could even output a log-info "All output comes from a single source package, copying README, CHANGELOG and YAML" or so?

@nevrome
Copy link
Member

nevrome commented Jun 3, 2024

I would prefer a specific flag. Otherwise adding or removing a single individual from a selection could lead to a very different output. That is confusing - especially for automatic processing.

@stschiff
Copy link
Member

OK, I agree. Also to not make this a breaking change, a flag would be good. There are two basic options:

1.) A flag --preservePyml that simply checks if there is only a single source package and in that case takes in the yml file (upon reflection I would not extend this towards the README and CHANGELOG). If there are more than one source-package than this option will result in an error.

2.) An option --usePyml that actually takes in the title of a source package from which it should copy. This would be more flexible but also more cumbersome.

I would vote for option 1.

@nevrome nevrome mentioned this issue Jun 28, 2024
@nevrome
Copy link
Member

nevrome commented Jul 11, 2024

Implemented and merged 1.) in #304

@nevrome nevrome closed this as completed Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants