Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the language type within each tag independently determined to decide whether or not to translate the content inside that tag block. #514

Open
XM-8JD2 opened this issue Sep 25, 2024 · 5 comments
Labels
enhancement New feature or request features Only for important features for implement

Comments

@XM-8JD2
Copy link

XM-8JD2 commented Sep 25, 2024

Describe the enhancement

Make the language type within each tag independently determined to decide whether or not to translate the content inside that tag block.

For example, on YouTube, the language of each comment may not be the same, and the language of each recommended video may also differ. By independently identifying their language based on their separate ID blocks or class blocks, it can prevent the unnecessary translation of languages I am familiar with and avoid potential information loss caused by this.

The same page has different languages ​​distributed in different blocks:
Snipaste_2024-09-24_18-58-45
Each comment is in a different block:
Snipaste_2024-09-24_18-56-19


Perhaps it could learn from uBlock Origin by adding a feature that allows selecting a block within the page and adjusting its coverage area, with the ability to quickly undo changes if a mistake is made.

@vitonsky
Copy link
Collaborator

Please describe your use case and detailed description what you expectations about how it would works.

Answer next questions:

  • What problem you have?
  • Why you need this changes and how this changes will solve your problem?
  • What exactly behavior you expect? Describe as many details as possible

@vitonsky vitonsky added the questions Further information is requested label Sep 25, 2024
@XM-8JD2
Copy link
Author

XM-8JD2 commented Sep 25, 2024

  • What exactly behavior you expect? Describe as many details as possible

What problem do you have?

When the page contains more than one language at the same time, the target language already in the page will be translated repeatedly

For example, for YouTube:
Original comment and recommended video name:
Snipaste_2024-09-25_21-22-22

Translated original comments and recommended video names (there is a loss of information, you can notice that some words that were already in Simplified Chinese are missing):
Snipaste_2024-09-25_21-22-40

Why you need this change and how this change will solve your problem?

Repeated translations can sometimes cause the meaning of the language to change (although in some cases, the translation from Traditional Chinese to Simplified Chinese usually does not change the meaning)

What exactly behavior you expect? Describe as many details as possible

I hope that the parts of YouTube's comment text and recommended video name text that already conform to the target language will not be translated. I think the way to achieve this goal is to provide functions: "Specific web element tag names or specific web element classes for specific websites." Additional judgment is made to determine the language type of the included text, and these texts are translated separately from the existing process and then assembled back into the overall translation result."

For example, the YouTube comment web element in the picture has the label name "ytd-comment-thread-renderer" and the class "style-scope ytd-item-section-renderer". We may be able to use this information to locate a specific website ( For example, a specific web element of https://www.youtube.com/*), then first check what language the text in the web element is, and then translate the text. All web elements that are defined as needing special treatment like this After processing, remove these elements from Linguist's existing workflow, as if they never existed on this web page, and then execute what Linguist is doing now.

The configuration I am currently using (English->Chinese):
linguist-config_1727272681167.json

Extension: Let the user click F12 to determine the web element where the text of interest is located, and enter the required web element tag name or web element class, which may not be user-friendly. So you can learn uBlock Origin and add a web element locator. To make this process user-friendly, but not necessary screenshot of uBlock Origin's "web element locator":
Snipaste_2024-09-25_21-48-23
If I actually create this rule, this rule is logged by uBlock Origin as:
Snipaste_2024-09-25_21-48-49

Then all comments really disappear, which means that this rule can indeed target all similar web elements:

Snipaste_2024-09-25_21-48-30

As shown above, Linguist may also be able to add a UI to store rules like this, and then perform re-judgment of text types on such rules. Although the element locator seems to do something to allow the element to It is accurately positioned and is not exactly the same as the simple web element tag name or web element class.

@vitonsky
Copy link
Collaborator

Ok, let's clarify if i correct understand you.

  • The problem is that Linguist translates a texts at language that is not desirable to translate, so in result of unnecessary translation (for example from zh to zh) the meaning of text may be changed and broken
  • Your proposal is to implement feature to define rules how to detect content text on page, and then detect language of such texts and do not translate this texts in case the language is the same as target language to translation

Is this correct?

If yes, then i have question about your opinion for alternative ways how to solve a problem.
I like the idea with selector for pick text nodes that are most important on page, it would be useful also to translate this text first.

As maintainer i have to prefer the most simple and straightforward solutions, so we should talk about all possible ways how to solve a problem, then pick the best one, but anyway the idea with rules picker may be implemented in future as part of other features, thanks for idea.

So what you think, maybe there are other ways to solve a problem? For example, it is technically possible to implement option "do not translate texts on target language", that will force language detection on every text element and try to detect its language, then in case the language is the same as target language - text element will be ignored by translator.

Keep in mind that we use language detection implemented by browser, so it may works bad for short texts, so in result text will not be detected and translator will consider such texts as subjects of translation

@XM-8JD2
Copy link
Author

XM-8JD2 commented Sep 26, 2024

Ok, let's clarify if i correct understand you.

Yes, you have perfectly summarized my point.

So what you think, maybe there are other ways to solve a problem? For example, it is technically possible to implement option "do not translate texts on target language", that will force language detection on every text element and try to detect its language, then in case the language is the same as target language - text element will be ignored by translator.

Your approach is better; it solves the problem I face with much less effort than mine. Because for my problem, there's no need to distinguish which web element the text comes from; it's enough to differentiate that it comes from different web elements (since adjacent texts rarely mix two languages).

Keep in mind that we use language detection implemented by browser, so it may works bad for short texts, so in result text will not be detected and translator will consider such texts as subjects of translation

It indeed doesn't work well sometimes, so I usually disable automatic detection and always set the detected language to English, which means Linguist will always translate for me.


It seems I haven't made any sufficiently good suggestions, but thank you for maintaining this project. It allows me to plan to move away from proprietary browsers and Google, or at least to deeply customize my translation experience.


Related to this, I may have a new question: when there are more than one unfamiliar language on the page (for example, Japanese and English), Linguist seems to only translate one of them. Here’s an example; I modified the page's text content using F12 to generate this example, as this situation is not very common, but I have indeed encountered it before.(The only thing I did was replace the original text with "こんにちは、こんにちは、こんにちは...")
original page:
Snipaste_2024-09-26_23-04-24
Translated pages:
Snipaste_2024-09-26_23-04-46

@vitonsky vitonsky added enhancement New feature or request features Only for important features for implement and removed questions Further information is requested labels Sep 26, 2024
@vitonsky
Copy link
Collaborator

@XM-8JD2 ok then current issue will a request for feature, to implement option "do not translate texts on target language".

I think we may make this option enabled by default.

About your question in last sentence, please create a new issue, to improve problem visibility and its tracking.
If it possible, provide instructions how to reproduce a problem, it would speedup debugging process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request features Only for important features for implement
Projects
None yet
Development

No branches or pull requests

2 participants