We do all of NeMo-Text-Processing's development in the open. Contributions from the open-source community are welcome.
Send your PRs to the main
branch
- Make sure your PR does one thing. Have a clear answer to "What does this PR do?".
- Make sure you sign your commits. E.g. use
git commit -s
when you commit. - Make sure to add test cases for both
pytest
and Sparrowhawk here. - Make sure all unittests finish successfully before sending PR:
pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).- Sparrowhawk tests
bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
- If you are adding a new Python file with a license header, the first line needs to be
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
(change2023
to the current year). - If your
text_normalization/LANG/graph_utils.py
is mainly copied from nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should beCopyright 2015 and onwards Google, Inc.
. See an example here. - Add
__init__.py
for every folder and subfolder. - Remove import guards (
try import: ... except: ...
) if not already done. - follow codeQL results and remove unused variables and imports (report is at the bottom of the PR in github review box)
- Add your language support to tools/text_processing_deployment/pynini_export.py.
- Optional: if you added a new language or a new feature please update the NeMo documentation (lives in different repo).
- Send your PR and request a review
-
en/graph_utils.py
anden/utils.py
are the de facto parents for all othergraph_util
andutils
functions, respectively. Please refrain from duplicating code logics into newgraph_utils.py
andutils.py
files and default to imports fromen/graph_utils.py
anden/utils.py
instead.LANG/graph_utils.py
andLANG/utils.py
files should only contain new methods and variables. Not all new languages will require a submodule specificgraph_utils.py
orutils.py
file. -
NeMo-Text-Processing allows creation of FST graphs through two backends: the Python based library itself (via Pynini backend) and C++ based Sparrowhawk in an upstream repo. Due to the typical tradeoffs between these Python and C++ development languages, the NeMo-Text-Processing library assumes development to be performed with the Python library for final deployment in Sparrowhawk/C++. This dual framework approach can lead to issues in development, notably in the case of tagging additional properties during tokenization.
When writing taggers for semiotic classes, you may need to tag additional token properties (e.g. grammatical gender, case) for accurate verbalization. For example, the Spanish ordinal 21.º
carries masculine gender and is verbalized with a specific spelling. As such, it would be desired for the TN tagger to tokenize the string with the gender property included.
Naively, one may be tempted to simply include the property string gender: "masc"
and check for this string during the verbalization phase. This is not advised. While the NeMo-Text-Processing library itself will permit any custom string in the tagger, Sparrowhawk limits permissible strings, and will fail with custom property strings. Given the performance loss in not providing Sparrowhawk support, we cannot integrate new graphs that cause Sparrowhawk failure. As such, tagged properties should be limited to Sparrowhawk supported strings.
For all classes, Sparrowhawk supports the morphosyntactic_features
property, and it is recommended to default to this property for tagging additional features. For example:
21.º" -> ordinal { integer: "vigésimo primero" morphosyntactic_features: "masc" }
For additional Sparrowhawk supported properties by class, see here
N.B. The same limitation applies for novel semiotic classes as well. Only predefined classes are supported in Sparrowhawk.
- Between the tagging and verbalizing stages, both the NeMo-Text-Processing and Sparrowhawk engines permute order of tagged properties. That is, assuming the tagger parsed
1ᵉʳ juillet
as:
date { month: "juillet" day: "1" } }
the verbalizer will receive as input both
date { month: "juillet" day: "1" }
and
date { day: "1" month: "juillet" }
While this eases construction of verbalization graphs, permutation can be computationally expensive. If you know that the tagger output will not require permutation of token properties, you can improve model performance by including the preserve_order: "true"
property:
date { day: "1" month: "juillet" preserve_order: "true" }