Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the default Jieba dict for Chinese search if not set #13005

Merged
merged 8 commits into from
Oct 19, 2024

Conversation

Snoopy1866
Copy link
Contributor

@Snoopy1866 Snoopy1866 commented Oct 11, 2024

Subject: if user do not set the dict option, fallback to the default dict provided by jieba

Feature or Bugfix

  • Feature

Purpose

  • Provide a fallback if user do not set the dict option when html_search_language value is zh

Detail

  • jieba has provided a default dict, whose path can be easily obtained by call jieba.get_dict_file().name

Relates

@picnixz
Copy link
Member

picnixz commented Oct 11, 2024

AFAICT, this would leave a file descriptor opened (jieba.get_dict_file() opens a file in read mode). So ideally, we should find a way to retrieve the default dict path without having to open the file. Would it be possible?

@Snoopy1866
Copy link
Contributor Author

Snoopy1866 commented Oct 11, 2024

@picnixz yes, because the default dict is installed with jieba, so we can directly get its path using os.path.join(os.path.dirname(jieba.__file__), "dict.txt")

@@ -234,7 +234,10 @@ def __init__(self, options: dict[str, str]) -> None:

def init(self, options: dict[str, str]) -> None:
if JIEBA:
dict_path = options.get('dict')
default_dict_path = os.path.join(
os.path.dirname(jieba.__file__), 'dict.txt'
Copy link
Member

@picnixz picnixz Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the jieba constant of dict.txt instead? (I think there is a constant with that name) or is this constant not meant to be publicly available?

If we want to only rely on public API and not on dunder attributes, we could open the file using a with statement, get the filename and then close it. It would probably be costly on Windows machines because open() calls can be up to 15x times slower compared to Linux but that would probably the second cleanest way to simplify maintenance.

Copy link
Contributor Author

@Snoopy1866 Snoopy1866 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We can use jieba.DEFAULT_DICT_NAME to get the default dict name, but without its parent directory path.
  2. We can use inspect.getfile(jieba) to get the path of the module jieba.
  3. Finally we can use os.path.join(os.path.dirname(inspect.getfile(jieba)), jieba.DEFAULT_DICT_NAME) to get the path of the path of default dict file, which is not relying on dunder attributes.

One more thing, I find the inspect provide a 'public API' inspect.getabsfile which can get the absolute path of specify module, but it has not documented for 10 years. cpython#56526

If a not documented API is not a public API, it's better to use inspect.getfile instead, I think.

But at the same time, Github Copilot says that inspect.getfile may return a relative path, but I have not found any evidence on the Internet.

Copy link
Contributor Author

@Snoopy1866 Snoopy1866 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified my code, please review again. @picnixz

@@ -2,6 +2,7 @@

from __future__ import annotations

import inspect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this import in the function it is being used just to save a bit of import time (unless this is already imported by some other dependency)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use inspect here? It's a very heavy package, we can just use .__file__?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use a public API but maybe it's an overkill (my bad since it was my suggestion :')).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just revert to 46753b9 and dynamically get the dict file using jieba.DEFAULT_DICT_NAME is enough, I think.

Copy link
Contributor Author

@Snoopy1866 Snoopy1866 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use inspect here? It's a very heavy package, we can just use .__file__?

Should I modify the code to meet it?

Can we have this import in the function it is being used just to save a bit of import time (unless this is already imported by some other dependency)?

May be that's a compromise, but I already have no better idea about how to make a balance between using public API and avoiding importing a heavy package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use the dunder attribute directly. Sorry for the back and forth

Copy link
Contributor Author

@Snoopy1866 Snoopy1866 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use the dunder attribute directly. Sorry for the back and forth

Done! Please review it again.

@jayaddison jayaddison added i18n python Pull requests that update Python code labels Oct 19, 2024
@AA-Turner AA-Turner changed the title Fallback to default dict if not set Use the default Jieba dict for Chinese search if not set Oct 19, 2024
@AA-Turner AA-Turner merged commit 80642f6 into sphinx-doc:master Oct 19, 2024
23 checks passed
@AA-Turner AA-Turner added this to the 8.2.0 milestone Oct 24, 2024
@Snoopy1866 Snoopy1866 deleted the feat-search-zh-jieba-fallback branch November 5, 2024 11:58
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
html search i18n python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants