Add some questions; sample config in YAML.

lcnetdev · Jul 10, 2022 · 23c0904 · 23c0904
1 parent b149a87
commit 23c0904
Show file tree

Hide file tree

Showing 3 changed files with 8,132 additions and 12 deletions.
diff --git a/NOTES.md b/NOTES.md
@@ -5,6 +5,9 @@
 The `.cfg` format seems to follow a INI-like syntax that is ad-hoc-parsed by
 the Transliterator.
 
+Unicode points are expressed as `U+????` rather than `\x????` of the standard
+INI syntax.
+
 So far only top-level section names have been encountered.
 
 Key-value pairs may express either a transliteration operation, e.g.
@@ -31,27 +34,44 @@ transliteration sections.
 It is unclear how configuration directives can be distinguished from
 transliteration rules, except by naming all the possible verbatim copy options.
 A more readable and efficent format would have discrete subsections for
-configuration and transliteration; if possible, vebatim copy should be
-implicit, which would make maintenance easier.
+configuration and transliteration—if necessary, with configuration and mapping
+subsections inside S2R and R2S sections.
+
+Q: Is it possible to copy non-mapped characters verbatim in script to Roman?
+That would remove the need to explicitly add English phrases to the S2R section
+such as `publisher not identified=publisher not identified`.
 
-It is unclear at the moment if spaces around the `=` sign are ignored.
+Q: Shall spaces around the `=` sign be ignored?
 
+Q: What are the `_` at the end of some mappings, e.g. `U+4E00=yi_` for Chinese?
+Are they supposed to add a space where the underscore appears?
 
 ## `ReRomanizeRecord.bas`
 
 Much of the code deals with MARC records. No need to concern about that since
 the new Transliterator is meant to convert text strings to text strings.
 
+Q: Is it possible (and desirable) to determine the S2R/R2S direction from user
+prompt rather than guessing it from the text as the legacy software seems to
+be doing?
+
+Q: The software seems to take multi-line directives in the configuration into
+account. Is it possible to avoid these for simplicity, or is there a need to
+express some mapping in multiple lines?
+
+Detailed breakdown of individual functions follows.
+
+
 ### Functions
 
-#### ReRomanizeText
+#### `ReRomanizeText`
 
 - Determine direction (input param): R2S or S2R
 - Determine personal name handling
 - Whether to uppercase first word
 
 
-#### LoadOneRomanizationTable
+#### `LoadOneRomanizationTable`
 
 Load cfg file (line by line, we can do the whole thing) and parse table
 metadata.
@@ -100,28 +120,128 @@ Likely irrelevant keywords:
 - Subfield6Code
 - SubfieldsAlwaysExcluded
 
-If no keyword is detected, proceed to transliteration. [TODO transliteration
-logic details still to be looked at]
+If no keyword is detected, proceed to transliteration.
 
 
 ##### `ScriptToRoman` section
 
 The logic is the same as the `RomanToscript` section, but the configuration
-keyword are different.
+keywords are different.
 
 Currently supported, potentially relevant:
 
  - UppercaseFirstCharacterInSubfield [TODO verify]
  - PersonalNameHandling
 
- Likely irrelevant:
+ Likely irrelevant (MARC-related):
 
  - CreateEmptyFields
  - FieldsIncluded
  - SubfieldsAlwaysExcluded
  - OtherSubfieldsExcludedByTag
 
-[TODO Complete other function analysis]
+There is some code (deactivated) to dump the whole table.
+
+
+#### `RomanizeConvertText`
+
+This gets called at several points by `ReRomanizeText`.
+
+It accepts `in`, supposedly the input text, a UTF8MarcRecordClass (probably the
+transliteration table) and a UTF8CharClass (?).
+
+It replaces `"_"` with `" "` within the transliteration table (so values with
+`_` are adding spaces to the translitrated text.
+
+It loops through the text and detects the following leaders: 
+
+There is a `LocalMarcRecordObject.SafeStuff()` function that seems critical but
+I can't seem to find a definition for it.
+
+There is a select switch within a loop whose function is not entirely clear. Is
+it to advance through sections of the text by MARC record internal markers?
+
+```
+Select Case sLeader$
+    Case "&H"
+        sLeader$ = "U+"
+    Case "U+"
+        sLeader$ = "&x"
+    Case "&x"
+        sLeader$ = "&X"
+    Case "&X"
+        sLeader$ = "&h"
+    Case "&h"
+        Exit Do
+End Select
+```
+
+
+#### `LoadRomanizationTables`
+
+This function seems to load all the tables by calling
+`LoadOneRomanizationTable` over a list of files.
+
+
+#### `ReRomanizeTextDetails`
+
+This is the logic of the romanization process by character or syllable.
+
+- Proceed by syllable or by character based on config (note: currently there
+    doesn't seem to be any cfg file containing the `BySyllables` option)
+- Decide to allow case variation based on config
+- Proceed scanning the text and looping; look for MARC delimiters
+  - Define initial only, terminal only, medial only characters
+  - Decide whether to translate R2V or V2R
+
+
+#### `EvaluateFirstCharacter`
+
+This determines if the translation is R2V or V2R. Does this work reliably and
+independently of any external directive? Could there be some strings in foreign
+scripts that start with Latin characters (e.g. numbers or Western terms), and
+lead to unexpected results?
+
+(Also the translation is supposedly purpose-driven, as the user should have a
+specific direction in mind and wouldn't want the software to decide for them.)
+
+
+#### `ReRomanizeTextDetailsReplaceApostrophes`
+
+Replace apostrophe characters with glyphs supported by foeign script?
+
+
+#### Field- and UI-related functions
+
+- `RomanizationAssistance`
+- `FindFieldCurrentlyPointedTo`
+- `RomanizationAssistanceConvertWholeRecord`
+- `ReRomanizeAdjustNonfilingIndicators`
+- `AddCharSetCodes2Utf8Record`
+- `FindScriptByKeyPress`
+- `IsFontInstalled`
+
+These functions seem to deal with interface interaction, field/text selection,
+and RTF clenaup. Probably very little to nothing needs to be carried over.
+
+Question: do we need to keep any formatting of the original text? And if yes,
+which formatting tags are allowed? (in an HTML UI the formatting could be in
+HTML or Markdown, so this may need to be taken into account.)
+
+Q: It seems like several characters are parsed and added to the text to denote
+MARC markers. Do we need to deal with these manually as indicators related to
+the script/language handled, or shall we expect any text string input in the
+new Transliterator to be clean from MARC flags? 
+
+#### `RomanizeConvertDecimalChars`
+
+Convert escape sequences `&#\d{4,5}` to code points.
+
+#### `CreateRomanizationScriptList`
+
+Create list of romanization script options by reading a master file. This will
+likely be replaced by a glob-like approach.
+
 
 ## General strategy for rewrite
 
@@ -178,7 +298,7 @@ server start. This should be auth-protected.
 
 `POST /reload_config`
 
-##### Authantication
+##### Authentication
 
 API token (probably just a hard-coded value in a .env file should suffice)
 
@@ -190,7 +310,7 @@ API token (probably just a hard-coded value in a .env file should suffice)
 
 ### Functional approach
 
-1. Load all translation table metadata on server startup. This is equivalent
+1. Upon server startup: load all translation table metadata. This is equivalent
    to invoking `reload_config` via REST API (see above) and is done by
    scanning a designated directory containing only the translation table,
    finding the metadata in the `General` section, disccovering the