Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the error messages in case of a genotype data parsing failure #224

Open
nevrome opened this issue Feb 8, 2023 · 4 comments
Open
Labels
enhancement New feature or request for the future

Comments

@nevrome
Copy link
Member

nevrome commented Feb 8, 2023

At the moment an issue in the genotype data is always reduced to

Issues in genotype data parsing:
SeqFormatException "Error while parsing: not enough input. Error occurred when trying to parse this chunk: \"...\""

That often does not help to identify and solve the underlying issue, because it omits in which package + SNP (+ individual?) the problem occurred. If such an error comes up in a big forge, debugging becomes a search for the needle in the haystack. The short snipped of the relevant chunk in the error message above can be pointless, when the genotype data is in a binary format.

I wonder if there is a way to include additional, crucial information in this error message.

@nevrome nevrome added the enhancement New feature or request label Feb 8, 2023
@nevrome nevrome transferred this issue from poseidon-framework/community-archive Feb 8, 2023
@stschiff
Copy link
Member

We can definitely include a size-check for the genotype data, which I'm happy to include into our validation pipeline.

@nevrome
Copy link
Member Author

nevrome commented Feb 22, 2023

It has become clear that a size-check is not possible. But independent of that I was hoping for more.

I hope we could get an error message that looks something like this:

Issues in genotype data parsing:
Can not parse SNP A in line B of file C of package D for individual E.
Error occurred when trying to parse this chunk: "this is not the SNP you're searching for"

Is this science fiction with our current implementation?

@stschiff
Copy link
Member

OK, so going back to size checks: Since we do have the snpSet (1240K, HumanOrigins, Other) in the YAML file, we should actually be able to give a size-check warning after all, at least in cases where it's either 1240K or HumanOrigins. We can hardcode the expected number of SNPs for these categories and then use the number of of individuals to compute an expected byte size of the *.bed or the *.geno files. Of course, I think a mismatch between expectation and should not yield a hard error, because technically one could imagine some packages simply dropping SNPs which are uncovered (the schema doesn't forbid this, and our forging technology can handle this explicitly). But at least we can spit out a warning.

I'll work on that.

@stschiff
Copy link
Member

It has become clear that a size-check is not possible. But independent of that I was hoping for more.

I hope we could get an error message that looks something like this:

Issues in genotype data parsing:
Can not parse SNP A in line B of file C of package D for individual E.
Error occurred when trying to parse this chunk: "this is not the SNP you're searching for"

Is this science fiction with our current implementation?

I think it's not science fiction. My sequence-formats parsers can provide all that information, it's just a matter of having all the data ready to create that error message, which might involve some refactoring here and there. I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request for the future
Projects
None yet
Development

No branches or pull requests

2 participants