Server for serving LLM models over HTTP using GGML as backend.
Adapt config.example.toml
to your needs and save as config.toml
, then:
cargo run --release
To enable logging:
RUST_LOG=llmd=debug cargo run --release
$env:RUST_LOG="llmd=debug"
cargo run --release
To generate completions:
curl -XPOST --url http://localhost:3000/v1/task/pythia/completion --header 'Content-type: application/json' --data '{"prompt": "Hello "}' -vvv
To stream completions as they are generated:
curl --url "http://localhost:3000/v1/task/pythia/live?prompt=foo&max_tokens=10" -vvv
To generate embeddings:
curl -XPOST --url http://localhost:3000/v1/task/pythia/embedding --header 'Content-type: application/json' --data '{"prompt": "Hello "}' -vvv
curl -XPUT "http://localhost:3000/v1/memory/test?api_key=foo" -vvvv -d "Hello, world" -H "Content-type: text/plain"
See openapi.yaml for more information (incomplete as of yet).
Build with feature qdrant
enabled, then configure:
[memories.qtest]
store = { qdrant = { url = "http://localhost:6334", collection = "test" } }
dimensions = 3200
embedding_model = "orcamini3b"
See poly-backend for further information.
From the root of the repository:
sudo docker build -t llmd -f cublas.Dockerfile .
sudo docker run -it --rm -v $(pwd)/data:/llmd/data -v $(pwd)/config.toml:/llmd/config.toml --gpus all -e RUST_LOG=debug -p 3000:3000 llmd
To chat, connect through WebSocket to the following endpoint:
ws://localhost:3000/v1/task/pythia/chat?api_key=<key>
Send messages as text frames, and receive individual token messages. When a message is finished, the server will send an empty text frame.
Unless public
is set to true
in config, access is only granted when a valid API key is provided. Depending on configuration
this can be a pre-shared static API key and/or a JWT token.
API keys need to be supplied either using an Authorization: Bearer <key>
header, or using a ?api_key=<key>
query parameter.
When both are supplied the header key takes precedence.
To use static pre-shared API keys, add the allowed API keys to the config file (allowed_keys
).
To use JWT tokens, first configure a shared secret key in config. Currently only a symmetric key is supported:
jwt_private_key = { symmetric = "..." }
Generated tokens should use the HS256
algorithm and have an expiry time set (exp
). If an nbf
(not valid before) time
is present, it will be validated.
To generate a token for testing, use cargo run --bin token
(this token by default expires in an hour).