Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
FineWeb-2: multilingual, numpy 2.0, minhash improvements (#285)
* change fw quality to strict inequality * bugfix for empty lines (breaking chinese samples) * word tokenizers changes: use spacy when possible, added missing languages from spacy and stanza * add all available tokenizers and all iso-639-1 languages * fix tokenizer issues * add todo * changed indic langs tokenizers to indicnlp * fix tests * added khmer, tibetan and lao * using new tokenizer assignment and language definition system * fix default script * add assignments file * fixes for japanese and tibetan * fixed khmer tokenizer not being active * fix for nan * added south azerbeijani proxy tokenizer * more tibet workarounds * georgian tokenizer and korean fix * added georgian in tokenizer assignments * fix for korean tokenizer: remove very large numbers * add additional punctuation and improve number normalization for other scripts * fix number pattern * added saving cluster_sizes in minhash and bugfixed saving cluster ids * added fallback whitespace tokenizer and fixed tokenizer assignment for iso1 codes * fix memory leaks in word/sent tokenization * ignore ruff * add memoryz zone to spans * add regex to reqs, update hf tests to reflect the new datasets version, fix word tokenizers global vars * empty commit * empty commit * fmt * unlock tensorflow version * bump flask * fix flasky test * add comment about flask * actually fix the flaky test . > ! * japanese tok bugfix * more generous split for japanese to overcome whatever weird normalization they do * allow restarting from "sorting buckets" part when ooming in minhash * small refactor * ugfix * bugfix * jpn word_tokenize * added sparse arrays option * add tqdm * add log msg * rust * rust * rust * rust * rust * rust * rust * messages * messages * messages * messages * sort list of files * no async sanity test * fixes * updates * bunch of changes * fix def value * added check * added check * added check * added check * remove useless lock * some improvements * 1 sec * GIVE ME MY PROGRESS BARS GOD DAMN IT * GIVE ME MY PROGRESS BARS GOD DAMN IT * GIVE ME MY PROGRESS BARS GOD DAMN IT * revert * stupid logspath * giving up. just printing now * giving up. just printing now * network limiting * network limiting * updated work_tokenizer assignments and added burmese * add dependency * add local version * remove progress message * fix for no .remove file * fix missing language tokenizer * fixes for empty folders * reuse word tokenizations between blocks * remove dumb print * updated url filter blocklists * updated symbollinesformatter * moved rust tool * add rust tool readme * fix terminal punctuation in fineweb quality filter --------- Co-authored-by: Hynek Kydlicek <[email protected]> Co-authored-by: Hynek Kydlicek <[email protected]>
- Loading branch information