Large Language Models Existed Already Before ChatGPT
With the introduction of ChatGPT, the world has been turned upside down – or so it seems. In fact, however, large language models existed even before the release of OpenAI's ChatGPT. Google and the University of Toronto laid the foundation for a transformation from "simple" natural-language Processing (NLP) to generative AI already in 2017, in their paper titled "Attention is all you need". It presents a transformer architecture which uses an attention mechanism to register relationships between words and model dependencies over longer text sequences. This makes it possible to generate longer, coherent texts and even create entire dialogues between humans and AI systems.
In 2018, OpenAI introduced its first generative pre-trained transformer (GPT) based on transformer architecture. Since then, language models have been continuously expanded, in terms of their datasets as well as the size of the models themselves – from initially 115 million parameters (GPT-1) to the rumoured 170 trillion parameters (GPT-4).
Rights of Use
After the release of ChatGPT in November 2022, the community gained momentum and numerous new models were introduced. A critical aspect in considering these models is their transparency. Although some models are labelled open-source, the foundation of their development basis renders them unsuitable for commercial use. For example, some models use dialogs generated with ChatGPT as training data. This makes the use of these models legally difficult, because OpenAI's usage rights do not allow use of the API to generate competing products. Some models, on the other hand, use Meta's leaked LLaMA whose use is only permissible for science. To allow use of the LLaMA offshoots, their delta weights must be applied to the original LLaMA. The delta weights run under the Apache 2.0 license, but the underlying Meta model runs under a non-commercial license. This highlights the fact that there is often not one, but two licenses for a model: One for the inference code (often a bit more permissive), and another for the weights of the model, which is often somewhat more restrictive (as in the case of LLaMA) and excludes commercial use. With this strategy, a company can assume the label "open source" (the code having being licensed), but at the same time prevent commercial use of its proprietary know-how (because the code is of no use without the model weights).
Recently, however, licenses for important models (and their weights) have been made more permissive. One such example is Falcon, which initially allowed commercial use only in return for royalties, but was soon later furnished with an Apache 2.0 license. The license of LLaMA-2, unlike LLaMA, also allows commercial, royalty-free use - but to a limited extent. If the number of active users exceeds the threshold of 700 million in a month, a usage permit from Meta is required.
Commercial availability and usability of models as alternatives to ChatGPT is desirable for the development of new LLM applications, as shown by the emergence of projects which pursue this open approach and therefore use their own architectures and cleansed or crowd-sourced data sets. A prominent example of this is Dolly 2.0, for which Databricks generated over 15,000 training data sets with its own employees in a massive crowd-sourcing initiative after the first model Dolly 1.0 which had been trained using data generated on ChatGPT and was commercially prohibited. Another example is the RedPajama project by together.ai, which reproduced the original Meta LLaMA data sets and subsequently trained models itself, releasing them under a permissive license.
Data Security
Another critical aspect concerns the use and thus the security of the data with which LLMs interact. Where are enterprise-sensitive data stored for fine-tuning models or enriching prompts, and how are they possibly used further? The same question arises for (perhaps even partly personal) data exchanged in chats. The use of OpenAI's models has been strongly questioned and criticized in this regard. In the meantime, OpenAI has tweaked this aspect so that sharing of data with OpenAI for further development of AI models can be controlled by users themselves via an opt-in. However, OpenAI reserves the right to retain the data for 30 days for monitoring possible misuse. Microsoft has the same arrangement for its Azure OpenAI service which allows access to OpenAI models via REST APIs.
In this debate, the German startup Aleph Alpha is clearly orienting itself toward data security – it relies on its own data centre in Germany and attaches great importance to compliance with the General Data Protection Regulation (GDPR). In addition, the start-up uses explainable-AI methods to track AI output and even suppress output without a trusted source. MosaicML's platform for model training and deployment has adopted the same stance and promises not to use or store the data.
These topics are relevant if a company decides not to host the LLM itself, but send the data to an external API for fine-tuning and user prompts. This is precisely why open-source models are attractive for enterprises, as they can be deployed on proprietary infrastructure so as to guarantee control over own data.