Complete automated testing for all inference layers #84

philippzagar · 2024-12-19T22:57:33Z

Use Case

SpeziLLM should provide a robust testing infrastructure that supports automated testing across all LLM inference layers: local, fog, and cloud, as well as the core SpeziLLM target. This infrastructure then enables comprehensive UI and unit testing, significantly reducing the risk of deploying broken or unreliable code.

Problem

As of now, SpeziLLM contains little testing infrastructure with only some UI and unit tests. The test coverage of code and functionality is rather low, with the exception of the OpenAI function calling DSL.
This leads to scenarios where bugs escape and are merged into main (then tagged accordingly), effectively leading to us shipping error-prone code and breaking functionality of people depending on SpeziLLM.
Also, this leads to a broad variety of issues, such as necessary fixes slowing down feature development, inconsistent functionality, high maintenance effort and scalability challenges when onboarding new people on the project.

Solution

SpeziLLM should provide a decent testing infrastructure which enables testing of all LLM inference layers, so local, fog, and cloud as well as the core SpeziLLM target.
As we're dealing with language models that require lots of compute/network capabilities and dependencies to external systems, automated testing is not very straightforward.

To be tackled testing areas:

Local inference layer
- SpeziLLM uses mlx-swift for LLM inference operations on the local layer (the actual device). Currently, MLX doesn't enable direct testing of LLM inference on the local iOS Simulator. However, a common workaround is to run the application using MLX as a Mac (Designed for iPad) application, as described in the [MLX documentation](Mac (Designed for iPad) destination to your target). This shouldn't be an issue as SpeziLLM supports macOS.
- The desired goal is to test the full local inference capability of SpeziLLM end-to-end (including model download, inference streamed in a chat format) via UI tests. The used language model must be rather small, in order to lower compute and network requirements. Still, such a UI test will be rather time-intensive in the CI, so we need to think about the implications of such a test on the runner.
- In addition, smaller unit tests regarding the LLM download functionality and simple inferences should be added, enabling quicker / simpler testing feedback cycles.
Fog inference layer
- The fog layer depends on a deployed fog node (via Docker) somewhere on the local network which is discoverable via mDNS. In a proper local testing setup, this requires such a node to be running in the local development environment, such as a Mac. In the CI, a fog node must be started up on the actual GitHub runner, then perform the UI- / unit-test on the SpeziLLM-side, and then cleanup all fog node containers resources.
- However, as our CI runners (Mac Minis / studios) are already virtualized and the CI runner already runs the tests in containers, it will be challenging to host such a fog node deployment somewhere on the CI runner because of limitations on nested virtualizations. We need to figure out a good workaround for that issue.
Cloud inference layer (OpenAI)
- The cloud layer depends on OpenAI APIs to stream back the inference result to the SpeziLLM client. As we cannot use the real OpenAI APIs in automated testing (cost issues when running tests in the CI), we need to mock this service. As we're currently still using the MacPaw OpenAI Swift client, mocking an API is hardly possible in the current setup, with the exception of a separate mock server deployed on the CI runner that implements the OpenAI API and provides mock responses. However, as soon as SpeziLLMOpenAI: Replace MacPaw/OpenAI With Generated API Calls #64 gets merged (OpenAPI-generated OpenAI client), we should have more freedom to implement mocking features on the OpenAI layer.
- Testing the function calling capabilities with mocked OpenAI responses should be a good way forward to go beyond the current, limited function calling testing of the DSL.

Alternatives considered

Alternatives to proper unit testing, UI testing and CI checks are rare.
Manual testing and code reviews sadly still misses some errors in the code, leading to us shipping broken code, and we simply lack the manpower (no QA) for such an approach.

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct and Contributing Guidelines

The text was updated successfully, but these errors were encountered:

philippzagar added the enhancement New feature or request label Dec 19, 2024

philippzagar self-assigned this Dec 19, 2024

philippzagar added this to Project Planning Dec 19, 2024

github-project-automation bot moved this to Backlog in Project Planning Dec 19, 2024

philippzagar added the help wanted Extra attention is needed label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete automated testing for all inference layers #84

Complete automated testing for all inference layers #84

philippzagar commented Dec 19, 2024

Complete automated testing for all inference layers #84

Complete automated testing for all inference layers #84

Comments

philippzagar commented Dec 19, 2024

Use Case

Problem

Solution

Alternatives considered

Additional context

Code of Conduct