Training datasets are the foundation of AI model accuracy and reliability. Their open availability ensures replicability and transparency in AI development.
The source code includes the algorithms and methodologies used in training AI models. Its openness is crucial for understanding and improving AI systems.
AI models are the output of the training process. They should be openly accessible to ensure that the benefits of AI can be widely distributed.
All training datasets used in the development of open-source AI models should be publicly available, ensuring transparency in the AI development process.
The choice of license for the final model should consider whether it imposes restrictions on the datasets used.
Publishing the distribution of training data helps in understanding the model's potential biases and limitations.
Disclosure of the number of epochs provides insight into the training depth and potential overfitting issues.
Open access to the complete source code and training tools used in developing AI models is essential for fostering innovation and collaborative improvement.
Innovation in the LLM world is today largely driven by the academic world. Researchers need to be able to publish the results of their work and their latest innovations in complete security, without being restricted by the destination of their work and its exploitation by third parties. This also gives them the opportunity to integrate learning datasets in complete security, without being held liable in the event of unethical use of the model trained on these data.
We therefore propose that models can be published under two licenses, depending on their destination: academic or commercial.
A specific license catering to academic and scientific research encourages innovation and knowledge sharing in the academic community.
A separate commercial license ensures that the commercial use of AI models is regulated, promoting responsible and ethical business practices.