Skip to content

Latest commit

 

History

History
53 lines (27 loc) · 2.62 KB

draft-proposal-OpenSourceAI.md

File metadata and controls

53 lines (27 loc) · 2.62 KB

Draft of enhancements to the Open Source AI Definition

1. Distinct AI Artifacts: Datasets, Source Code, and Models

a. Training Datasets:

Training datasets are the foundation of AI model accuracy and reliability. Their open availability ensures replicability and transparency in AI development.

b. Source Code:

The source code includes the algorithms and methodologies used in training AI models. Its openness is crucial for understanding and improving AI systems.

c. Models:

AI models are the output of the training process. They should be openly accessible to ensure that the benefits of AI can be widely distributed.

2. Transparency in Training Datasets

a. Publication Requirements:

All training datasets used in the development of open-source AI models should be publicly available, ensuring transparency in the AI development process.

b. License Contamination Consideration:

The choice of license for the final model should consider whether it imposes restrictions on the datasets used.

3. Publication of Training Data Distribution and Epochs

a. Training Data Distribution:

Publishing the distribution of training data helps in understanding the model's potential biases and limitations.

b. Number of Epochs:

Disclosure of the number of epochs provides insight into the training depth and potential overfitting issues.

4. Full Publication of Model Source Code and Training Tools

Open access to the complete source code and training tools used in developing AI models is essential for fostering innovation and collaborative improvement.

5. Distinct Model Licensing for Academic/Scientific and Commercial Use

Innovation in the LLM world is today largely driven by the academic world. Researchers need to be able to publish the results of their work and their latest innovations in complete security, without being restricted by the destination of their work and its exploitation by third parties. This also gives them the opportunity to integrate learning datasets in complete security, without being held liable in the event of unethical use of the model trained on these data.

We therefore propose that models can be published under two licenses, depending on their destination: academic or commercial.

a. Academic/Scientific License:

A specific license catering to academic and scientific research encourages innovation and knowledge sharing in the academic community.

b. Commercial License:

A separate commercial license ensures that the commercial use of AI models is regulated, promoting responsible and ethical business practices.