Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze Corpora Job #585

Open
johnml1135 opened this issue Dec 20, 2024 · 2 comments
Open

Analyze Corpora Job #585

johnml1135 opened this issue Dec 20, 2024 · 2 comments
Assignees

Comments

@johnml1135
Copy link
Collaborator

johnml1135 commented Dec 20, 2024

When a Paratext projects is uploaded to Serval as a file and a Corpora is created from it, it would be nice to be able to run a gauntlet of analysis tests to see how it is expected to perform for different NLP tasks including (but not limited to) AI drafting in NLLB200. The basic flow would be this:

Architecture Serval level

On the Corpora endpoint, add a [get]analysis endpoint
When a corpora is made, the analysis is automatically queued.
When a corpora (or file that the corpora is made of) is updated, the analysis is automatically updated
You can request the analysis and if it is not complete it will return the status saying "not complete"
Analysis status will be tracked within the corpora IRepository
Analysis data will be a Json file stored on disk
Analysis files will be deleted when then are superceded or the corpora is deleted

Architecture Machine level

A new GRPC endpoint called analysis
The machine-engine server will queue the job
The machine-job server will run the job in python directly in k8s (not ClearML, no S3 bucket interaction) for < 30 second response time.
The results will be a Json file that is either written to file or returned over GRPC

what types of analysis could be done?

Before we implement these, we need to review them for suitability, determine if research is needed and what format we should use to provide to SF.

  • Check for all instances of unparsable USFM.
    • Parse each book and see if there are any failures. Don’t let the user select books that fail, but tell them that they are failing and need to be updated.
  • Check for mal-formed USFM that will likely not be parsed to match intent:
    • We can also create algorithms to look for "suspicious" USFM, including lines that don't begin with a verse number, missing verses, chapter markers without a chapter number, verses in the wrong order, duplicate verses, etc. We can respond "nicely" to them when they occur, but we should warn the user that something is wrong.
  • Detecting incorrect versification
    • We can run the algorithms we already have for this, specifically to detect that the verses present in the books match the expected versification, and that there are not extra (or missing) verses at the end of some chapters (where the rest of the chapter is filled out).
  • Detecting non-normalized text (mixed scripts, spelling issues etc)?
    • Can we run Wildebeest?
    • From it we can get a lot of analysis about scripts and odd characters - how would we present it to the user?
  • Book metrics
    • Verse counts, completion status, words per verse, characters per verse
  • NLLB Tokenization metrics
    • number of Characters not recognized by NLLB tokenizer
    • number of characters per token
  • Detecting versification misalignment
    • To test for”‘verses being off by one” would need research, but could be deterministic, such as correlating sentence lengths per chapter and making sure that they match to a specified degree or by running an actual alignment and looking for significantly misaligned verse chunk. This could be a comparison against a standard translation, either the greek and hebrew or the KJV, etc.
@ddaspit
Copy link
Contributor

ddaspit commented Jan 3, 2025

Would Scripture Forge use this? Or is this for tech support troubleshooting?

@johnml1135
Copy link
Collaborator Author

I would assume that it will likely go through 3 phases (or that each aid would go through these 3)

  1. Test it out with the EITL team, in SILNLP
  2. Transcribe it into machine.py, roll out the proposed architecture in Serval and have the analysis be displayed in a simple way in SF, grouping error types (basic GUI)
  3. Integrate output into Lynx intelligently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants