natural language not identified #88

sydb · 2024-07-19T16:50:18Z

#74 notwithstanding, there seem to be at least some cases where the natural language is simply not encoded. For a gross example, there is an entire <div> towards the end of article 000114 (starts on line ~3070) that is (except for part of the <title>) entirely in French, but is explicitly encoded as being in English. (There is only one @xml:lang in the entire file, and it is <text xml:lang="en">. :-)

There are various ways to go about finding such cases. The first is manually: Using new search interface, enter common words from a non-English language (say “deux”, which is French), and search for it selecting “Search only in” all other languages except French. (This might be done more easily by either using the URL directly, e.g. something like this monstrosity; or by just selecting English, figuring it probably is either properly encoded as French or not encoded as French, in which case it most likely shows up as English, but not likely to be mis-encoded as, say, Arabic.)

Another method to kick this off would be to

Add xml:lang="en" to every outermost element.
Find all tokens that are not indicated as being in English.
For each such token, search for it in English text.

This process would produce a list of thousands of tokens, and for each token a list of files for which that token is found in English text. E.g.

---------Bélgica:
---------Bérard:
     12 articles/000297/000297.xml
      1 articles/000482/000482.xml

would indicate that “Bélgica” does not occur in English text, and “Bérard” occurs in English text 13 times, once in article 000482, the rest in 000297. (These make me think that removing proper nouns from the list of tokens could be quite helpful.)

The problem with this approach is that it would be a sort of a reverse case of Schlemiel the painter. After someone fixes a chunk of text (say the <div> in 000114) the information would be out-of-date. The list of foreign tokens will have expanded by 0 or more tokens (maybe a lot), and the set of English text nodes to search would have decreased by at least one. Thus it might seem reasonable to run the search again. (But given how long it takes, it would not be.)

Notes-to-self for each item in numbered list above
[1] This can be done with just xmlstarlet ed --inplace --ps --insert "/*[not(@xml:lang)]" -t attr -n xml:lang -v en ARTICLES, where ARTICLES is either articles/*/*.xml (for all possible articles plus some false hits), articles/0*/0*.xml (for all articles that are not demos or templates, avoiding most false hits), or $( xsel -t -m "/toc/journal[not(@editorial='true')]//item" -v "concat('articles/', @id,'/', @id, '.xml')" -n toc/toc.xml ) for the list of non-editorial articles from the TOC.
[2] Something like xsel -t -m "//text()[not(ancestor::*[@xml:lang][1]/@xml:lang='en')]" -v "normalize-space( translate( translate( .,' ',' '), '()[]{}?:.,;','') )" -n ARTICLES | perl -pe 's, ,\n,g;' | rank is fast to do and comes reasonably close.
[3] For which time for w in LIST ; do echo "---------$w:" ; xsel -t -m "//text()[ancestor::*[@xml:lang][1]/@xml:lang='en'][starts-with(normalize-space(.),'$w') or contains(normalize-space(.),'$w')]" -f -n ARTICLES | rank ; done, where LIST is either the output of step 2 or just the code for step 2 in a subshell, would almost do the job. It fails if the foreign word contains an apostrophe. And, of course, given that there are almost 32,000 tokens to test, each against ~740 files, it would take a really long time to run. If searching a single file takes an average of 10 ms the entire run would take ~2¾ days. So finding some way of reducing the number of tokens to search for would be a very helpful first step. E.g., removing the 679 cases of just numbers (i.e., ^[0-9]+$). Then figuring out if there is any way to do this in parallel. And maybe doing something clever like extracting all the English text from every file in advance.

The text was updated successfully, but these errors were encountered:

sydb added the encoding update Global update to DHQ article encoding label Jul 19, 2024

sydb self-assigned this Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

natural language not identified #88

natural language not identified #88

sydb commented Jul 19, 2024

natural language not identified #88

natural language not identified #88

Comments

sydb commented Jul 19, 2024