You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SpeziLLM should provide a robust testing infrastructure that supports automated testing across all LLM inference layers: local, fog, and cloud, as well as the core SpeziLLM target. This infrastructure then enables comprehensive UI and unit testing, significantly reducing the risk of deploying broken or unreliable code.
Problem
As of now, SpeziLLM contains little testing infrastructure with only some UI and unit tests. The test coverage of code and functionality is rather low, with the exception of the OpenAI function calling DSL.
This leads to scenarios where bugs escape and are merged into main (then tagged accordingly), effectively leading to us shipping error-prone code and breaking functionality of people depending on SpeziLLM.
Also, this leads to a broad variety of issues, such as necessary fixes slowing down feature development, inconsistent functionality, high maintenance effort and scalability challenges when onboarding new people on the project.
Solution
SpeziLLM should provide a decent testing infrastructure which enables testing of all LLM inference layers, so local, fog, and cloud as well as the core SpeziLLM target.
As we're dealing with language models that require lots of compute/network capabilities and dependencies to external systems, automated testing is not very straightforward.
To be tackled testing areas:
Local inference layer
SpeziLLM uses mlx-swift for LLM inference operations on the local layer (the actual device). Currently, MLX doesn't enable direct testing of LLM inference on the local iOS Simulator. However, a common workaround is to run the application using MLX as a Mac (Designed for iPad) application, as described in the [MLX documentation](Mac (Designed for iPad) destination to your target). This shouldn't be an issue as SpeziLLM supports macOS.
The desired goal is to test the full local inference capability of SpeziLLM end-to-end (including model download, inference streamed in a chat format) via UI tests. The used language model must be rather small, in order to lower compute and network requirements. Still, such a UI test will be rather time-intensive in the CI, so we need to think about the implications of such a test on the runner.
In addition, smaller unit tests regarding the LLM download functionality and simple inferences should be added, enabling quicker / simpler testing feedback cycles.
Fog inference layer
The fog layer depends on a deployed fog node (via Docker) somewhere on the local network which is discoverable via mDNS. In a proper local testing setup, this requires such a node to be running in the local development environment, such as a Mac. In the CI, a fog node must be started up on the actual GitHub runner, then perform the UI- / unit-test on the SpeziLLM-side, and then cleanup all fog node containers resources.
However, as our CI runners (Mac Minis / studios) are already virtualized and the CI runner already runs the tests in containers, it will be challenging to host such a fog node deployment somewhere on the CI runner because of limitations on nested virtualizations. We need to figure out a good workaround for that issue.
Cloud inference layer (OpenAI)
The cloud layer depends on OpenAI APIs to stream back the inference result to the SpeziLLM client. As we cannot use the real OpenAI APIs in automated testing (cost issues when running tests in the CI), we need to mock this service. As we're currently still using the MacPaw OpenAI Swift client, mocking an API is hardly possible in the current setup, with the exception of a separate mock server deployed on the CI runner that implements the OpenAI API and provides mock responses. However, as soon as SpeziLLMOpenAI: Replace MacPaw/OpenAI With Generated API Calls #64 gets merged (OpenAPI-generated OpenAI client), we should have more freedom to implement mocking features on the OpenAI layer.
Testing the function calling capabilities with mocked OpenAI responses should be a good way forward to go beyond the current, limited function calling testing of the DSL.
Alternatives considered
Alternatives to proper unit testing, UI testing and CI checks are rare.
Manual testing and code reviews sadly still misses some errors in the code, leading to us shipping broken code, and we simply lack the manpower (no QA) for such an approach.
Additional context
No response
Code of Conduct
I agree to follow this project's Code of Conduct and Contributing Guidelines
The text was updated successfully, but these errors were encountered:
Use Case
SpeziLLM should provide a robust testing infrastructure that supports automated testing across all LLM inference layers: local, fog, and cloud, as well as the core SpeziLLM target. This infrastructure then enables comprehensive UI and unit testing, significantly reducing the risk of deploying broken or unreliable code.
Problem
As of now, SpeziLLM contains little testing infrastructure with only some UI and unit tests. The test coverage of code and functionality is rather low, with the exception of the OpenAI function calling DSL.
This leads to scenarios where bugs escape and are merged into
main
(then tagged accordingly), effectively leading to us shipping error-prone code and breaking functionality of people depending on SpeziLLM.Also, this leads to a broad variety of issues, such as necessary fixes slowing down feature development, inconsistent functionality, high maintenance effort and scalability challenges when onboarding new people on the project.
Solution
SpeziLLM should provide a decent testing infrastructure which enables testing of all LLM inference layers, so local, fog, and cloud as well as the core SpeziLLM target.
As we're dealing with language models that require lots of compute/network capabilities and dependencies to external systems, automated testing is not very straightforward.
To be tackled testing areas:
mlx-swift
for LLM inference operations on the local layer (the actual device). Currently, MLX doesn't enable direct testing of LLM inference on the local iOS Simulator. However, a common workaround is to run the application using MLX as aMac (Designed for iPad) application
, as described in the [MLX documentation](Mac (Designed for iPad) destination to your target). This shouldn't be an issue as SpeziLLM supports macOS.Alternatives considered
Alternatives to proper unit testing, UI testing and CI checks are rare.
Manual testing and code reviews sadly still misses some errors in the code, leading to us shipping broken code, and we simply lack the manpower (no QA) for such an approach.
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: