Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible issue with text alignment for deuterocanonical books and unusual versifications #478

Open
Enkidu93 opened this issue Sep 4, 2024 · 9 comments
Assignees

Comments

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Sep 4, 2024

See here for more information: #471 (and here - #442 (comment))

Let's add some tests to verify that we are handling ALL of the deuterocanonical books properly (same amount of testing as for the standard English versification books.

@johnml1135 johnml1135 added this to Serval Nov 1, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Serval Nov 1, 2024
@Enkidu93 Enkidu93 moved this from 🆕 New to 🏗 In progress in Serval Dec 11, 2024
@johnml1135
Copy link
Collaborator

johnml1135 commented Dec 23, 2024

The issue identified was that the mapping between DAN and SUS was incorrect because of a bug in the preprocess code. To really verify that all books work, there are a few main points where this could mess up:

  • In the versification files themselves (there could be a VRS error - how do we handle it?)
  • In the ParallelCorpus algorithm mapping versifications
  • In the preprocess script (as there was here)
  • In the USFM update script when the new USFM files are created.

Where should we make tests for this?

  • First, ensure that there are unit tests for preprocess and USFM update to specifically verify that the target references are used, not the source references. These test may already exist.

Then, use ParallelCorpus level testing to verify the following:

  • Note: Versification mappings are already tested in SIL.Scripture: https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/Versification.cs
  • We want to make sure we are testing improper versification mappings or mismatched between the VRS files and the SFM files including:
    • Verses in SFM but not in vrs
    • Verse in vrs, but not in SFM
    • Double mapping? Two different verses map to the same location? What do we do? Do we throw an error, use the first one or combine the verses (i.e., Daniel 13 and BEL are both defined)
  • Mapping between books works
  • Mapping between chapters works
  • Mapping where both the source and target have odd (non-original) versification (i.e. Target has DAN, Orig has BEL, Source has SUS (can this even happen?) - make sure it makes the trip successfully through).
  • I don't see the immediate value in testing all versifications or all deuterocanonical books - just in testing the edge cases. I am assuming that SIL.Scripture does full testing of these books.

@Enkidu93, what do you think?

@Enkidu93
Copy link
Collaborator Author

The issue identified was that the mapping between DAN and SUS was incorrect because of a bug in the preprocess code. To really verify that all books work, there are a few main points where this could mess up:

* In the versification files themselves (there could be a VRS error - how do we handle it?)

I'd be curious for Damien's thoughts, but I wouldn't worry to much about incorrect .vrs files. As the saying goes, 'garbage in garbage out'. I don't think we can take responsibility for this, and undefined behavior is appropriate behavior when bad data is provided. This is something that's skirting the edge of the purview of Serval/Machine. If we do anything at all about this, I'd suggest just a quick validation of the .vrs file (if that's possible) and throw a meaningful error if it's just plain invalid. We simply can't account for all the possible bad .vrs files out there, and furthermore, I'm not sure how often this is an issue: Seems like Paratext should have an automatic check for this or something (?).

* In the ParallelCorpus algorithm mapping versifications

* In the preprocess script (as there was here)

What exactly was the bug?

* In the USFM update script when the new USFM files are created.

This should be less of an issue, right, since we're not doing aligning here so much as just (potentially) changing versification? Where are you imagining an error like this would creep in?

Where should we make tests for this?

* First, ensure that there are unit tests for preprocess and USFM update to specifically verify that the target references are used, not the source references.  These test may already exist.

Yes, we should do this - at least for preprocessing. I'm not sure a test exists that specifically covers this.

Then, use ParallelCorpus level testing to verify the following:

* Note: Versification mappings are already tested in SIL.Scripture: https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/Versification.cs

* We want to make sure we are testing improper versification mappings or mismatched between the VRS files and the SFM files including:
  
  * Verses in SFM but not in vrs
  * Verse in vrs, but not in SFM
  * Double mapping?  Two different verses map to the same location?  What do we do?  Do we throw an error, use the first one or combine the verses (i.e., Daniel 13 and BEL are both defined)

Like I said above, I don't know if these are things we really can/should test.

* Mapping between books works

* Mapping between chapters works

* Mapping where both the source and target have odd (non-original) versification (i.e. Target has DAN, Orig has BEL, Source has SUS (can this even happen?) - make sure it makes the trip successfully through).

I think these are all in scope, and hopefully, Mudi has covered some of these in his tests so far (?).

* I don't see the immediate value in testing all versifications or all deuterocanonical books - just in testing the edge cases.  I am assuming that SIL.Scripture does full testing of these books.

Yep, I agree.

@ddaspit
Copy link
Contributor

ddaspit commented Jan 3, 2025

Regarding incorrect vrs files, there might be some simpler cases that we could handle automatically. For example, if the USFM contains a verse that isn't specified in the vrs file, we could automatically add it.

@mudiagaobrikisil
Copy link
Collaborator

I have written tests for the following:

  • Improper versification mappings
  • Verses in vrs but not in SFM. Ensure the verse is not in the parsed paratext project. It should gracefully continue even if there’s a verse in the vrs and not in the SFM.
  • Verses in SFM but not in vrs. Ensure the verse is not in the parsed paratext project. It should gracefully continue even if there’s a verse in the vrs and not in the SFM.
  • Mapping where both the source and target have odd (non-original) versification. As it stands, the test would throw an exception if there is alignment issue with the varying versifications, which is the case for some of the books. However I intend to modify the test to handle the sample case John mentioned (Target has DAN, Orig has BEL, Source has SUS (can this even happen?) and make the test throw an error if there is an oddity.

I intend to write the following tests:

  • To handle double mapping - undefined behaviour but it shouldn’t throw an exception

  • To ensure mapping between books work - this is essential as the bug has to do with mapping between DAN and SUS

Mapping between chapters work

I also intend to verify that there are tests that check if the target references are used for the preprocess and USFM update. If there are none, I write the tests

@johnml1135
Copy link
Collaborator

@ddaspit - how would we add the verse if it isn't in the VRS file? Would we just assume there is no mapping? What if there is mapping? Should we follow some logic like "if there is no mapping used in this specific book then add the verse"?

Also, I am assuming in general we want to be graceful and forgiving for bad VRS and USFM files. There will be other places in Serval that will analyze these issues and tell the user to fix up their stuff, but I believe that this would be a bad place to do it - I believe people are expecting it to to a "best guess" job and not crash.

@ddaspit
Copy link
Contributor

ddaspit commented Jan 10, 2025

The versification defines the number of chapters for each book and the number of verses for each chapter. If the verse number for a particular chapter in the USFM exceeds the number of verses for that chapter in the versification, then we could update the number of verses in the versification.

@johnml1135
Copy link
Collaborator

And, I am assuming, extend any mapping from one book/chapter to another. That sounds feasible.

@ddaspit
Copy link
Contributor

ddaspit commented Jan 13, 2025

@johnml1135 I'm not sure what you mean by extend any mapping from one book/chapter to another. Can you clarify?

@johnml1135
Copy link
Collaborator

vrs files contains:

NAM 1:15 2:13 3:19
NAM 1:15 = NAM 2:1
NAM 2:1-13 = NAM 2:2-14

What if there are extra verses NAM 1:15, NAM 1:16, NAM 2:14 and NAM 2:16. How should they be handled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Development

No branches or pull requests

4 participants