Skip to content

Releases: LanguageMachines/ticcltools

v0.11

16 Dec 13:22
Compare
Choose a tag to compare
  • require C++17
  • require latest ticcutils
  • Now we use NFC endoded Unicode strings everywhere
  • testrank script results were outdated since 0.10
  • removed dependency on libtar
  • added --follow option to TiCCL-indexer(NT)
  • several code refactoring and cleanup
  • adapted tests
  • updated GitHub CI

v0.10

01 Mar 10:46
Compare
Choose a tag to compare

[Ko van der Sloot]

  • LDcalc:
    • No longer filter out n-grams with common parts. Was too aggressive
    • Removed some more outcommented old code
  • chainclean: added a --caseless option. (Default is true)
  • Removed Roaring versions of the code. Lacked maintenance for years.
  • internally shifting towards UnicodeString in general
  • a lot of C++ cleanup, with some refactoring, splitting up long blobs of code

v0.9

14 Sep 12:02
Compare
Choose a tag to compare

Ko van der Sloot:

  • LDcalc: removed code to filter out ngrams with common parts (experimental)

Maarten van Gompel:

  • Added Dockerfile: containerization support
  • Changed repository status to unsupported!

v0.8

15 Dec 14:44
Compare
Choose a tag to compare
  • using more recent functions from ticcutils
  • use more code from ticcl_common
  • attempt to solve #42
  • some small code refactoring

v0.7.1

15 Sep 11:11
Compare
Choose a tag to compare

[Ko vd Sloot]

  • changed ICU requirement to at least 5.6
  • some refactoring
  • started implementing a solution for #42
  • added error message when the index file is empty.

v0.7

15 Apr 13:02
Compare
Choose a tag to compare

[Martin Reynaert]

  • updated man pages
  • updated README.md

[Ko vander Sloot]
Numerous bug fixes and additions. Added a .so for common functions

The bitType is changed to uint64_t (for the biggest int possible) which
triggered some code adaptations. (values < 0 are not possible)

  • TICCL-unk:

    • some changes in UNK detection
    • added a --hemp option
    • create a .fore.clean file when a background corpus is merged in
  • TICCL-stats:

    • added a -n option to use a newline as delimiter
  • TICCL-indexer(NT):

    • better and faster implementation
    • added --confstats option
  • TICCL-LDcalc:

    • added a --follow option for debugging purposes
    • fix for #30
    • added --low and --high parameters
  • TICCL-rank:

    • added a --follow option for debugging purposes
    • added --subtractartifrqfeature1 and --subtractartifrqfeature2 options
    • replaced pairs_combined ranking by median ranking
    • added an n-garm filter
  • TICCL-chain:

    • added --nounk option
    • fix for #38
    • fix for #37
    • use the alphabet file too with --alph
  • TICCL-chainclean: new module to clean chain ranked files

  • TICCL-anahash:

    • accept lexicons without frequencies too. (also simple word lists)
    • added a -o option

v0.6

05 Jun 10:49
Compare
Choose a tag to compare

Intermediate release, with a lot of new code to handle N-grams
Also a lot of refactoring is done, for more clear and maintainable code.
This is work in progress still.

  • TICCL-unk:

    • more extensive acronym detection
    • fixed artifreq problems in 'clean' punctuated words
    • added filters for 'unwanted' characters
    • added a ligature filter to convert evil ligatures
    • normalize all hyphens to a 'normal' one (-)
    • use a better definition of punctuation (unicode character class is not
      good enough to decide)
  • TICCL-lexstat:

    • the 'separator' symbol should get freq=0, so it isnt counted
    • the clip value is added to the output filename
  • TICCL-indexer:

    • indexer and indexerNT now produce the same output, using different
      strategies when a --foci files is used.
  • TICCL-LDcalc:
    major overhaul for n-grams

    • added a ngram point column to the output (so NOT backward compatible!)
    • produce a '.short' list for short word corrections
    • produce a '.ambi' file with a list of n-grams related to short words
    • prune a lot of ngrams from the output
  • TICCL-rank:

  • output is sorted now
  • honor the ngram-points from the new LDcalc. (so NOT backward compatible!)
  • TICCL-chain: new module to chain ranked files

  • TICCL-lexclean:
    -added a -x option for 'inverse' alphabet

  • TICCL-anahash:

    • added a --list option to produce a list of words and anagram values
  • added metadata file: codemeta.json

v0.5

19 Feb 14:55
Compare
Choose a tag to compare
v0.5 Pre-release
Pre-release
  • updated configuration. also for Mac OSX
  • use of more ticcutils stuff: diacriticsfilter
  • added a TICCL-mergelex program
  • the OMP_THREAD_LIMIT environment variable was ignored sometimes
  • TICCL-unk:
    • fixed a problem in artifreq handling
    • changed acronym detection (work in progress)
    • added -o option
      TICCL-lexstat:
    • added TTR output
    • added -o option
      TICCL-indexer
    • now also handles --foci file. with some speed-up
    • added a -t option
      TICCL-LDcalc:
    • be less picky on a few wrong lines in the data
  • added some tests
  • when libroaring is installed we built roaring versions of some modules (experimental)
  • updated man pages

v0.4

04 Apr 10:38
Compare
Choose a tag to compare
v0.4 Pre-release
Pre-release
  • first official release.
    • added functions to test on Word2Vec datafiles
    • refactoring and modernizing stuff all around