Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

natural language not identified #88

Open
sydb opened this issue Jul 19, 2024 · 0 comments
Open

natural language not identified #88

sydb opened this issue Jul 19, 2024 · 0 comments
Assignees
Labels
encoding update Global update to DHQ article encoding

Comments

@sydb
Copy link
Contributor

sydb commented Jul 19, 2024

#74 notwithstanding, there seem to be at least some cases where the natural language is simply not encoded. For a gross example, there is an entire <div> towards the end of article 000114 (starts on line ~3070) that is (except for part of the <title>) entirely in French, but is explicitly encoded as being in English. (There is only one @xml:lang in the entire file, and it is <text xml:lang="en">. :-)

There are various ways to go about finding such cases. The first is manually: Using new search interface, enter common words from a non-English language (say “deux”, which is French), and search for it selecting “Search only in” all other languages except French. (This might be done more easily by either using the URL directly, e.g. something like this monstrosity; or by just selecting English, figuring it probably is either properly encoded as French or not encoded as French, in which case it most likely shows up as English, but not likely to be mis-encoded as, say, Arabic.)

Another method to kick this off would be to

  1. Add xml:lang="en" to every outermost element.
  2. Find all tokens that are not indicated as being in English.
  3. For each such token, search for it in English text.

This process would produce a list of thousands of tokens, and for each token a list of files for which that token is found in English text. E.g.

---------Bélgica:
---------Bérard:
     12 articles/000297/000297.xml
      1 articles/000482/000482.xml

would indicate that “Bélgica” does not occur in English text, and “Bérard” occurs in English text 13 times, once in article 000482, the rest in 000297. (These make me think that removing proper nouns from the list of tokens could be quite helpful.)

The problem with this approach is that it would be a sort of a reverse case of Schlemiel the painter. After someone fixes a chunk of text (say the <div> in 000114) the information would be out-of-date. The list of foreign tokens will have expanded by 0 or more tokens (maybe a lot), and the set of English text nodes to search would have decreased by at least one. Thus it might seem reasonable to run the search again. (But given how long it takes, it would not be.)

Notes-to-self for each item in numbered list above
[1] This can be done with just xmlstarlet ed --inplace --ps --insert "/*[not(@xml:lang)]" -t attr -n xml:lang -v en ARTICLES, where ARTICLES is either articles/*/*.xml (for all possible articles plus some false hits), articles/0*/0*.xml (for all articles that are not demos or templates, avoiding most false hits), or $( xsel -t -m "/toc/journal[not(@editorial='true')]//item" -v "concat('articles/', @id,'/', @id, '.xml')" -n toc/toc.xml ) for the list of non-editorial articles from the TOC.
[2] Something like xsel -t -m "//text()[not(ancestor::*[@xml:lang][1]/@xml:lang='en')]" -v "normalize-space( translate( translate( .,' ',' '), '()[]{}?:.,;','') )" -n ARTICLES | perl -pe 's, ,\n,g;' | rank is fast to do and comes reasonably close.
[3] For which time for w in LIST ; do echo "---------$w:" ; xsel -t -m "//text()[ancestor::*[@xml:lang][1]/@xml:lang='en'][starts-with(normalize-space(.),'$w') or contains(normalize-space(.),'$w')]" -f -n ARTICLES | rank ; done, where LIST is either the output of step 2 or just the code for step 2 in a subshell, would almost do the job. It fails if the foreign word contains an apostrophe. And, of course, given that there are almost 32,000 tokens to test, each against ~740 files, it would take a really long time to run. If searching a single file takes an average of 10 ms the entire run would take ~2¾ days. So finding some way of reducing the number of tokens to search for would be a very helpful first step. E.g., removing the 679 cases of just numbers (i.e., ^[0-9]+$). Then figuring out if there is any way to do this in parallel. And maybe doing something clever like extracting all the English text from every file in advance.

@sydb sydb added the encoding update Global update to DHQ article encoding label Jul 19, 2024
@sydb sydb self-assigned this Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding update Global update to DHQ article encoding
Projects
None yet
Development

No branches or pull requests

1 participant