-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gz compressed files #47
base: master
Are you sure you want to change the base?
Conversation
Why do we want to store compressed files in git? Seems like that git repo should just contain the |
@scottchiefbaker good question. Ideally the zipping/compression would be automated to avoid people having to do it manually in pull requests when a word is added to the list ... 🤔 |
Bandwidth constraints are still not much of an issue, the whole repository is about 6MB. |
Note that even if it is a plain-text file, GitHub does not allow viewing such file (too big: 4MB) So compressing the raw file is a valid option for GitHub (where this text file cannot be directly edited online in a browser without crashing it, GitHub already forbids it and treats it as a raw uneditable file). Even if people don't like complex compression, given that the file should be sorted, it can evidently be compressed a lot with common prefixes. For example, the sequence of words (here shown in JSON format only for illustration):
could be represented as:
where the initial integer (also storable as a single byte, rather than an ASCII digit here to allow more than 9) would indicate how many characters (possibly 0) to drop from the end of the previous entry before appending a new suffix indicated after this integer. Note that such compression scheme (even if it's simple) still requires a sequential access to get a full word (But this is true for any compression scheme like gzip). Or you'll need to look backward from a random position, to get the value of missing prefixes. If you're interested only if finding some word suffixes with known length, you don't need to to look backward until you find the complete word, if you just need to see if the suffix is present in the table, which can only be asserted if you scan backward until the start of the table. But the given order still requires searching from multiple position in the table where you find a suitable part of the suffix (so this still requires a sequential scan of the full table to locate these partial suffixes. For this optimization for searching suffixes to be effective you then need another index locating the common suffixes, and the table of all words needs to be first sorted in suffix order:
so that it can be represented as:
And this time, you can save lot of time by scanning the table only once to locate only the suffix. In any case, if you want to fast locate prefixes (or suffixes), a single sequential table is not sufficient, you need a tree-like structure (with tables containing table members): For the prefix search, the list of words is pre-ordered in prefix order (classic order for linguistic dictionnaries)
And then compressed recursively as:
where each entry of a table is either a single suffix, or an ordered array representing a set of suffixes (possibly empty) starting by the same characters, these characters being indicated in the single string entry just before this array. Given that entries contain only strings or array, you can represent the string by adding a 0 prefix, and the array by using the number of entries in the following array So:
Note also that I just used the count of entries in each subtables: this makes it uneasy for lookup because you still need to scan subarrays recursively to find the end of subarrays. This is easy to solve: instead of indicating the number of entries in the first level of subtables, you just need to store the total number of entries (in all subtable levels) to skip to locate the next item. But such thing is not needed if you just need to implementer a forward iterator scanning the full table of words from start to end (but note that the wordlist in this English text is big: 4MB, so even if it is compressed, it will scan over ~370,000 words (in the current version), so lookup of a single word will still be slow. You need then the indexing method where lookup of prefixes is faster. Using subarrays as in:
Which is also representable as a string by suppressing the quotes, some commas (only those before an opening bracket, or after a closing bracket), and the two outer brackets containing everything:
But this looks like a tree with nodes being either strings or pointers to subtrees (in each array, there's a variable number of nodes). Other possible representation could use B-trees (with a maximum number of nodes per page) but I don't think it would be efficient for such word list, as almost all arrays have a small number of items). B-trees are used in relational databases for their table indexes with arbitrary keys randomly distributed, but this wordlist does not have arbitrary keys as they form small sequences. The (g)zip compression (based on Lepel-Zip) is more complex because it can search common prefixes elsewhere than just the immediate previous word, within some "window" (a maximum distance backward from the word to encode). But this requires storing this distance for each entry, and it may then be more costly than the scheme proposed here. All you have to do is select how to represent numbers (the ASCII digits in the previous example) so that it packs well: 0 is very frequent, and could be represented by a single bit set to 0, other integers (higher than or equal to 1) will be encoded with this bit set to 1 (followed by the value of the integer minus 1, which could also use variable-ength encoding) |
added gz compressed files (from .txt files) for pako support