Skip to content

Commit

Permalink
Add some questions; sample config in YAML.
Browse files Browse the repository at this point in the history
  • Loading branch information
scossu committed Jul 10, 2022
1 parent b149a87 commit 23c0904
Show file tree
Hide file tree
Showing 3 changed files with 8,132 additions and 12 deletions.
144 changes: 132 additions & 12 deletions NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
The `.cfg` format seems to follow a INI-like syntax that is ad-hoc-parsed by
the Transliterator.

Unicode points are expressed as `U+????` rather than `\x????` of the standard
INI syntax.

So far only top-level section names have been encountered.

Key-value pairs may express either a transliteration operation, e.g.
Expand All @@ -31,27 +34,44 @@ transliteration sections.
It is unclear how configuration directives can be distinguished from
transliteration rules, except by naming all the possible verbatim copy options.
A more readable and efficent format would have discrete subsections for
configuration and transliteration; if possible, vebatim copy should be
implicit, which would make maintenance easier.
configuration and transliteration—if necessary, with configuration and mapping
subsections inside S2R and R2S sections.

Q: Is it possible to copy non-mapped characters verbatim in script to Roman?
That would remove the need to explicitly add English phrases to the S2R section
such as `publisher not identified=publisher not identified`.

It is unclear at the moment if spaces around the `=` sign are ignored.
Q: Shall spaces around the `=` sign be ignored?

Q: What are the `_` at the end of some mappings, e.g. `U+4E00=yi_` for Chinese?
Are they supposed to add a space where the underscore appears?

## `ReRomanizeRecord.bas`

Much of the code deals with MARC records. No need to concern about that since
the new Transliterator is meant to convert text strings to text strings.

Q: Is it possible (and desirable) to determine the S2R/R2S direction from user
prompt rather than guessing it from the text as the legacy software seems to
be doing?

Q: The software seems to take multi-line directives in the configuration into
account. Is it possible to avoid these for simplicity, or is there a need to
express some mapping in multiple lines?

Detailed breakdown of individual functions follows.


### Functions

#### ReRomanizeText
#### `ReRomanizeText`

- Determine direction (input param): R2S or S2R
- Determine personal name handling
- Whether to uppercase first word


#### LoadOneRomanizationTable
#### `LoadOneRomanizationTable`

Load cfg file (line by line, we can do the whole thing) and parse table
metadata.
Expand Down Expand Up @@ -100,28 +120,128 @@ Likely irrelevant keywords:
- Subfield6Code
- SubfieldsAlwaysExcluded

If no keyword is detected, proceed to transliteration. [TODO transliteration
logic details still to be looked at]
If no keyword is detected, proceed to transliteration.


##### `ScriptToRoman` section

The logic is the same as the `RomanToscript` section, but the configuration
keyword are different.
keywords are different.

Currently supported, potentially relevant:

- UppercaseFirstCharacterInSubfield [TODO verify]
- PersonalNameHandling

Likely irrelevant:
Likely irrelevant (MARC-related):

- CreateEmptyFields
- FieldsIncluded
- SubfieldsAlwaysExcluded
- OtherSubfieldsExcludedByTag

[TODO Complete other function analysis]
There is some code (deactivated) to dump the whole table.


#### `RomanizeConvertText`

This gets called at several points by `ReRomanizeText`.

It accepts `in`, supposedly the input text, a UTF8MarcRecordClass (probably the
transliteration table) and a UTF8CharClass (?).

It replaces `"_"` with `" "` within the transliteration table (so values with
`_` are adding spaces to the translitrated text.

It loops through the text and detects the following leaders:

There is a `LocalMarcRecordObject.SafeStuff()` function that seems critical but
I can't seem to find a definition for it.

There is a select switch within a loop whose function is not entirely clear. Is
it to advance through sections of the text by MARC record internal markers?

```
Select Case sLeader$
Case "&H"
sLeader$ = "U+"
Case "U+"
sLeader$ = "&x"
Case "&x"
sLeader$ = "&X"
Case "&X"
sLeader$ = "&h"
Case "&h"
Exit Do
End Select
```


#### `LoadRomanizationTables`

This function seems to load all the tables by calling
`LoadOneRomanizationTable` over a list of files.


#### `ReRomanizeTextDetails`

This is the logic of the romanization process by character or syllable.

- Proceed by syllable or by character based on config (note: currently there
doesn't seem to be any cfg file containing the `BySyllables` option)
- Decide to allow case variation based on config
- Proceed scanning the text and looping; look for MARC delimiters
- Define initial only, terminal only, medial only characters
- Decide whether to translate R2V or V2R


#### `EvaluateFirstCharacter`

This determines if the translation is R2V or V2R. Does this work reliably and
independently of any external directive? Could there be some strings in foreign
scripts that start with Latin characters (e.g. numbers or Western terms), and
lead to unexpected results?

(Also the translation is supposedly purpose-driven, as the user should have a
specific direction in mind and wouldn't want the software to decide for them.)


#### `ReRomanizeTextDetailsReplaceApostrophes`

Replace apostrophe characters with glyphs supported by foeign script?


#### Field- and UI-related functions

- `RomanizationAssistance`
- `FindFieldCurrentlyPointedTo`
- `RomanizationAssistanceConvertWholeRecord`
- `ReRomanizeAdjustNonfilingIndicators`
- `AddCharSetCodes2Utf8Record`
- `FindScriptByKeyPress`
- `IsFontInstalled`

These functions seem to deal with interface interaction, field/text selection,
and RTF clenaup. Probably very little to nothing needs to be carried over.

Question: do we need to keep any formatting of the original text? And if yes,
which formatting tags are allowed? (in an HTML UI the formatting could be in
HTML or Markdown, so this may need to be taken into account.)

Q: It seems like several characters are parsed and added to the text to denote
MARC markers. Do we need to deal with these manually as indicators related to
the script/language handled, or shall we expect any text string input in the
new Transliterator to be clean from MARC flags?

#### `RomanizeConvertDecimalChars`

Convert escape sequences `&#\d{4,5}` to code points.

#### `CreateRomanizationScriptList`

Create list of romanization script options by reading a master file. This will
likely be replaced by a glob-like approach.


## General strategy for rewrite

Expand Down Expand Up @@ -178,7 +298,7 @@ server start. This should be auth-protected.

`POST /reload_config`

##### Authantication
##### Authentication

API token (probably just a hard-coded value in a .env file should suffice)

Expand All @@ -190,7 +310,7 @@ API token (probably just a hard-coded value in a .env file should suffice)

### Functional approach

1. Load all translation table metadata on server startup. This is equivalent
1. Upon server startup: load all translation table metadata. This is equivalent
to invoking `reload_config` via REST API (see above) and is done by
scanning a designated directory containing only the translation table,
finding the metadata in the `General` section, disccovering the
Expand Down
Loading

0 comments on commit 23c0904

Please sign in to comment.