Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add package for Allentoft at al. 2024 #202

Merged
merged 3 commits into from
Oct 11, 2024

Conversation

smpeltola
Copy link
Contributor

@smpeltola smpeltola commented Aug 15, 2024

PR Checklist for a new package submission

  • The package does not exist already in the community archive, also not with a different name.
  • The package title in the POSEIDON.yml conforms to the general title structure suggested here: <Year>_<Last name of first author>_<Region, time period or special feature of the paper>, e.g. 2021_Zegarac_SoutheasternEurope, 2021_SeguinOrlando_BellBeaker or 2021_Kivisild_MedievalEstonia.
  • The package is stored in a directory that is named like the package title.

  • The package is complete and features the following elements:
    • Genotype data in binary PLINK format (not EIGENSTRAT format).
    • A POSEIDON.yml file with not just the file-referencing fields, but also the following meta-information fields present and filled: poseidonVersion, title, description, contributor, packageVersion, lastModified (see here for their definition)
    • A reasonably filled .janno file (for a list of available fields look here and here for more detailed documentation about them).
    • A .bib file with the necessary literature references for each sample in the .janno file.
  • Every file in the submission is correctly referenced in the POSEIDON.yml file and there are no additional, supplementary files in the submission that are not documented there.
  • Genotype data, .janno and .bib file are all named after the package title and only differ in the file extension.
  • The package version in the POSEIDON.yml file is 1.0.0.
  • The poseidonVersion of the package in the POSEIDON.yml file is set to the latest version of the Poseidon schema.
  • The POSEIDON.yml file contains the corresponding checksums for the fields genoFile, snpFile, indFile, jannoFile and bibFile.
  • There is either no CHANGELOG file or one with a single entry for version 1.0.0.

  • The Publication column in the .janno file is filled and the respective .bib file has complete entries for the listed mentioned keys.
  • The .janno file does not include any empty columns or columns only filled with n/a.
  • The order of columns in the .janno file adheres to the standard order as defined in the Poseidon schema here.
  • The .janno and the .ssf files are not fully quoted, so they only use single- or double quotes ("...", '...') to enclose text fields where it is strictly necessary (i.e. their entry includes a TAB).

  • The package passes a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

@stschiff
Copy link
Member

Thanks a lot, @smpeltola, for preparing this. Where did you get the genotype data from? Did you process the raw data yourself?

@smpeltola
Copy link
Contributor Author

@stschiff, the genotypes are directly from the vcf files Allentoft et al. provided, here: https://erda.ku.dk/archives/917f1ac64148c3800ab7baa29402d088/published-archive.html. I have pulled down 1240k positions, and for imputed data, set low quality imputed genotypes (maxGP < 0.99) to missing, and converted to PLINK.

@stschiff
Copy link
Member

stschiff commented Sep 2, 2024

OK brillant. I'll take a look soon.

@nevrome nevrome changed the title Add Allentoft at al. 2024 Add package for Allentoft at al. 2024 Sep 6, 2024
@stschiff
Copy link
Member

stschiff commented Oct 9, 2024

Sorry for taking so long @smpeltola. I think there is an amazing amount of work in this package, and I don't want to stand in the way of having it merged-in quickly. A quick question to the team:

This package does something we haven't seen before, I think, in that it provides both the imputed and the non-imputed genotypes within one package. @smpeltola has painstakingly marked all duplicate pairs as "identical", so in principle I think the information to properly analyze this dataset is there. The Poseidon_IDs reflect the processing, for example "NEO100" vs. "NEO100_imputed". I think this isn't exactly how we envisioned this, but it technically doesn't break our schema.

Any thoughts from the team? If not, I suggest @AyGhal can merge this in, I couldn't see any obvious problem with the package. Thanks again @smpeltola for this work!

@AyGhal
Copy link
Contributor

AyGhal commented Oct 9, 2024

The package looks good to me, thanks @smpeltola!
As far as I understood, both imputed and non-imputed data are 1240k pull-downs. I agree that this does not break our schema. The only thing I can request/recommend is to fill the Nr_SNPs column.
What do you all think?

@stschiff
Copy link
Member

Thanks, @AyGhal. Is filling the Nr_SNPs column easi, @smpeltola, in the sense that it can be done within an hour or so? If not, I suggest we merge this in as is and try to fill it later. I was anyway at some point planning to make that a feature of trident rectify, since it is directly retrievable from the genotype data.

@AyGhal
Copy link
Contributor

AyGhal commented Oct 11, 2024

Okay. We'll add that later.

@AyGhal AyGhal merged commit a8d321b into poseidon-framework:master Oct 11, 2024
1 check passed
@nevrome
Copy link
Member

nevrome commented Oct 23, 2024

I was just investigating why our mastodon action did not trigger when this PR was merged. It took me a while to realize that @smpeltola manually updated the chronicle file in this PR and therefore preempted the automated processes. You're impressively thorough 😮! It's not necessary to do this manually, because we have a GitHub action, which also triggers other downstream automation. But props to you for the attention to detail!

@stschiff & @AyGhal: Let's keep an eye out for this in future reviews. I never thought somebody would notice the archive.chron file and manually update it.

@smpeltola
Copy link
Contributor Author

Sorry about that @nevrome ! I saw a notification from the chronicle file and just clicked buttons until it went away 😅 Maybe you can add a note in the submission instructions that those can be ignored.

@stschiff I didn't fill the Nr_SNPs because, for some reason, though that trident rectify already had the functionality to retrieve them from the genotype data. I can fill them in if that's needed!

@stschiff
Copy link
Member

All good, @smpeltola, I think we were just impressed that you went to the trouble and updating the chronicle file manually. And indeed, we should definitely have a feature to fill-in the number of SNPs. I will put this on the to-do list. In the mean-time, whenever you find a minute to do it, a Pull Request to update the number of SNPs would be welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants