What qualifies as open-source AI? Open Source Initiative clarifies

By Pune Media On Oct 30, 2024 19

After sustained ambiguity around open-source artificial intelligence (AI), the Open Source Initiative (OSI) has introduced a definition for it. For companies to call their AI ‘open source’, they must share detailed information about the data they used to train said AI system. This would allow a developer to create their own AI models which are substantially similar to an existing open source AI model. Specifically, companies must include:

a complete description of all the data they use for training including unshareable data, disclosing the provenance of the data, and its scope and characteristics. Companies should also provide details of how they obtained and selected data, their labeling procedures, and data processing and filtering methodologies.
a list of all the publicly available training data that the company relied on and how to obtain it.
a list of all the training data obtainable from third-party sources, and where to procure it, including for fee.

Why it matters:

The concept of open source comes from software development. It refers to publicly accessible software (and also AI) that people can modify and share. Within the realm of software development, software whose source code is publicly available for people to inspect/modify is called open-source. The Open Source Initiative’s definition of open-source software is recognized internationally, including by several governments, which makes it a particularly relevant voice in the open-source ecosystem.

In the AI space, there is no agreed-upon definition of open-source AI. Lea Gimpel from the Digital Public Goods Alliance, pointed this out during Carnegie India’s Global Technology Summit last year. “There are currently several work streams and ways that are trying to define open source AI with the community to better understand what would we actually need to open source in order to maintain the benefits that we see in open source software,” she noted.

OSI explains that open source models must grant people the freedom to use the system permission for using the AI system without seeking permission. Companies creating such models also need to allow people to study the AI and inspect how it works, modify it for any purpose and share it with others to use (with or without modifications). Its definition could lend clarity on the full criteria an AI model release must meet to qualify as open source.

Other key details that open-source AI must include:

Besides details of the training data, companies must also release the following about their models to classify them as open-source AI:

The complete source code used to train and run the system. This includes the specifications of how the company processed and filtered data during the model training process, validation, testing, and also the specifications of the model architecture.
The model parameters such as weights and other configuration settings. For context, model weights are numerical parameters within an AI model that influence its output in response to inputs.

Companies can ask any person using their open-source model to release products/services built on top of the model under the same terms and conditions as the original model. OSI acknowledges that currently there is no clear way to “legally” make AI model parameters (weights) freely available, and companies could make these parameters freely available without a license or could require some kind of legal document to do so. “We expect this will become clearer over time, once the legal system has had more opportunity to address Open Source AI systems,” OSI adds.

How different companies classify open source:

Meta’s LLAMA models:

Many companies classify their AI systems as open source; for example, Meta labels its large language models Llama 2 and Llama 3 this way. The company makes the model weights publicly available but does not specify the training datasets it uses to train its models.

However, both these models have restrictions baked into their open-source licenses. One such restriction is on the scale of commercial organisations that can access Llama under an open-source license. If a company has more than 700 million monthly active users, they have to request Meta for a license before using Llama 2 and Llama 3. Gimpel explained during the aforementioned summit that Llama wouldn’t strictly fall within the scope of open source going by the software-based definition of open source because of its baked-in restrictions. Even with OSI’s newly created definition, these restrictions make Meta’s models fall short of the scope of “open source.”

Apple’s OpenELM:

Other companies like Apple have also come out with their own open-source AI systems namely the OpenELM family of AI models. Unlike Meta’s models, these models allow people to use, reproduce, modify and redistribute these models with or without making changes. They also provide the complete framework for “training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations,” the company’s research paper on OpenELM says. It comes with a complete list of datasets that Apple used for model training, this includes Wikipedia, Wikibooks, Reddit, Github, the open-access archive for scholarly articles arXiv.org, and Project Gutenberg.

Support for OSI’s definition:

So far, organisations like Mozilla Foundation, Common Crawl, Bloomberg Engineering, and Open Infra Foundation have endorsed OSI’s definition of open source. Names like Meta, whose open-source models would not fit OSI’s definition, are notably missing from the list of endorsements.

Also read:

_{Images are for reference only.Images and contents gathered automatic from google or 3rd party sources.All rights on the images and contents are with their legal original owners.}

_{Aggregated From –}

_{Source link}