Skip to content

Scala interfaces to huggingface transformers and tokenizers

Notifications You must be signed in to change notification settings

clulab/scala-transformers

This branch is up to date with main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ac810c0 · Jul 16, 2024
Sep 27, 2022
Jan 3, 2024
Jul 16, 2024
Jul 4, 2024
Nov 29, 2023
Sep 27, 2022
Oct 3, 2023
Mar 14, 2023
Jan 4, 2024
Sep 18, 2023
Aug 28, 2023
Sep 18, 2023
Dec 27, 2023
Jul 16, 2024
Sep 27, 2022
Jul 19, 2023
Aug 23, 2023
Jan 4, 2024

Repository files navigation

Build Status Maven Central

scala-transformers

Scala interfaces to newly trained Hugging Face/ONNX transformers and existing tokenizers

The libraries and models resulting from this project are incorporated into processors and generally don't need attention unless functionality is being modified, but here are some details about how it all works.

encoder

To incorporate the encoder subproject as a Scala library dependency, either to access an existing model or because you've trained a new one with the Python code there, you'll need to add something like this to your build.sbt file:

libraryDependencies += "org.clulab" %% "scala-transformers-encoder" % "0.4.0"

New models should generally be published to the CLU Lab's artifactory server so that they can be treated as library dependencies, although they can also be accessed as local files. Two models have been generated and published. They are incorporated into a Scala project with

resolvers += "clulab" at "https://artifactory.clulab.org/artifactory/sbt-release"

// Pick one or more.
libraryDependencies += "org.clulab" % "deberta-onnx-model"  % "0.0.3"
libraryDependencies += "org.clulab" % "roberta-onnx-model"  % "0.0.2"

The models make reference to tokenizers which also need to be added according to instructions in the next section.

Please see the encoder README for information about how to generate models and how to download and package Hugging Face tokenizers for use in the tokenizer subproject.

tokenizer

To use the tokenizer subproject as a Scala library dependency, you'll need to add something like this to your build.sbt file:

libraryDependencies += "org.clulab" %% "scala-transformers-tokenizer" % "0.4.0"

See the tokenizer README for information about which tokenizers have already been packaged and how they are accessed.