Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove enchant #59

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Remove enchant #59

wants to merge 18 commits into from

Conversation

DylanASHillier
Copy link
Contributor

@DylanASHillier DylanASHillier commented Feb 18, 2025

This was really annoying me...

Changes:
Adds dicts for english us and uk (downloaded from hunspell, which is used by e.g. openoffice) (envs.utils.word_lists.py)
Unifies the usage of word lists depending on if NLTK or not, also whether to use pronouns
Removes references to enchant everywhere + dependency
Also does some reworks to WordLadder environment (mainly on algorithm side)

Minor
My parser corrected some things, sorted imports etc.
Also may have to correct some of the typing I used in the word_lists utils file since it uses python 3.11

Initialize the Word Chains game environment

Args:
max_turns (int): Maximum number of turns before the game ends in a draw.
"""

# Ensure NLTK words are loaded
self.word_list = list((set(word.lower() for word in words.words())))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically this behaviour is changed as the dictionary is also built from the other dicts. so not just word stems

@@ -56,19 +42,21 @@ def reset(self, seed: Optional[int]=None):
random.seed(seed)

# pick a starting word
starting_word = random.choice(self.word_list)
starting_word = random.choice(self.dictionary.get_all_words())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for prior behaviour set to self.dictionary.nltk_words instead of get_all_words()

if flag in suffixes:
# apply suffix
for rule in suffixes[flag]:
# continue if flag is not mergeable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is not very pretty. Fairly certain it works but

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously not ideal to add 120,000 LOC but it is just a dictionary file. Need to add sourcing somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no effective differences here outside of changing dictionary source

for i, word1 in enumerate(filtered_words):
for word2 in filtered_words[i+1:]:
if self.one_letter_difference(word1, word2):
# Add edges for words differing by one letter using a more efficient approach
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more efficient but equivalent version (checked for equivalence)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efficiency really matters for large word list

# for i, word in enumerate(self.k_len_words):
# for other_word in self.k_len_words[i + 1 :]:
# if sum(a != b for a, b in zip(word, other_word)) == 1: ## check if the words differ by exactly one letter
# graph.add_edge(word, other_word)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove dead code?


for start_word in start_words:
# Use single-source shortest paths from each start word
lengths = nx.single_source_shortest_path_length(G, start_word)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more efficient than trying to do shortest path between all nodes... should still give a decent sampling since the words are sampled anyway...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant