Introduction
In the last few months, large-language models (LLMs) have taken the tech world by storm. The performance of OpenAI’s chat-based models, ChatGPT (GPT 3.5 Turbo) and GPT-4, is a significant step up from the last generation of models. There are several reasons for this significant jump in performance; however, the most significant factor is the careful curation of data for instruction-following tasks, as well as a large amount of training based on human feedback.
Large-language models like ChatGPT are first pre-trained on a large amount of text, then instruction-tuned on task-specific data, and finally, sometimes they’re trained with reinforcement learning on human feedback (RLHF). These innovations produce models that are high-quality and capable of following instructions to complete a variety of tasks.
While it seems at this moment that OpenAI’s proprietary models are a step above the rest, the open-source landscape is quickly closing the gap. And, thanks to Nx and Bumblebee, you can take advantage of many of the open-source ChatGPT competitors right now. In this post, we’ll talk about some of the strongest open-source competitors to ChatGPT, and how you can use them in Elixir.
Why Should I Use an Open-Source Model?
A common question that pops up when designing LLM-powered applications is why use open-source over proprietary models? Considering the performance of OpenAI’s offerings compared to open-source alternatives, it can be difficult to justify investing the time and effort into an LLM deployment. Machine learning deployments are difficult, and LLM deployments take this to an extreme. Of course, that doesn’t mean you should just blindly throw GPT-4 at any problem you have.
There are several reasons you may want to consider using open-source:
Data Privacy
One concern when using OpenAI’s (and other providers’) API is data privacy. Depending on your business use case, it may be unacceptable to send data to an external provider. Using an open-source model gives you control of the entire stack. You can work with proprietary and sensitive data without privacy concerns.
Latency
Depending on your specific use case, the latency of OpenAI’s API might be unacceptable. While GPT-3.5 Turbo provides great performance relative to latency, it may still not be fast enough to meet your needs. Fine-tuning a smaller model on task-specific data and avoiding an additional network call may prove a better option.
Task-Specific Performance
GPT-3.5 and GPT-4 have great performance on zero-shot tasks. That means you can go very far with just some careful prompting. With context injection via retrieval, GPT-3.5 and GPT-4 can effectively solve a wide range of tasks. That being said, fine-tuned models remain at the pinnacle of task-specific performance. If you have a specialized use case, and enough data and time to fine-tune a specialized model, you can achieve competitive or better performance than proprietary models.
Cost
A final consideration when deciding between open-source and proprietary models is cost. GPT-4 is a powerful model; however, that power comes at a cost. GPT-3.5 Turbo is much cheaper; however, the performance might be unacceptable for your specific task.
Flan-T5
Flan-T5 is a set of model checkpoints released from Google’s paper Scaling Instruction-Finetuned Language Models. Flan-T5 is a variant of the T5 architecture finetuned on a mixture of tasks. Specifically, Flan-T5 is instruction tuned on a wide variety of tasks. This finetuning process yields a model with competitive to state-of-the-art performance on a number of tasks.
Flan-T5 is one of multiple models you can use in Bumblebee. The most competitive checkpoint is flan-t5-xxl
; however, a finetuned flan-t5-xl
can be competitive as well. flan-t5-xxl
will require a large GPU as just the checkpoint parameters are around 45GB. To use flan-t5
for text generation, you can use Bumblebee to load both the tokenizer and model:
Nx.default_backend(EXLA.Backend)
{:ok, model} = Bumblebee.load_model({:hf, "google/flan-t5-xl"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "google/flan-t5-xl"})
Then you can wrap the model in a generation serving:
serving = Bumblebee.Text.generation(model, tokenizer, defn_options: [compiler: EXLA])
And create generations:
Nx.Serving.run(serving, "Elixir is a")
And you will see:
mystical or magical item used to enhance the powers of a person or animal.
Llama and Friends
Llama is a recent, popular open-source alternative to ChatGPT. Llama is a large-language model from Facebook. It is not instruction-tuned or trained on human feedback; however, there are a number of variants that have been finetuned on an instruction-specific dataset called Alpaca. These finetuned variants achieve competitive performance to ChatGPT and have taken off in popularity due to their performance.
One issue with Llama is its restrictive license. Llama and its weights were initially released to academics and other researchers with a license that restricted commercial use. After a leak, Llama and its variants have more or less popped up everywhere; however, its use is still restricted to non-commercial purposes.
You can use Llama today in the same way you’d use any other Bumblebee model:
Nx.default_backend(EXLA.Backend)
{:ok, model} = Bumblebee.load_model({:hf, "decapoda-research/llama-7b-hf"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "decapoda-research/llama-7b-hf"})
Then you can wrap the model in a generation serving:
serving = Bumblebee.Text.generation(model, tokenizer, defn_options: [compiler: EXLA])
And create generations:
Nx.Serving.run(serving, "Elixir is a")
And you will see:
mystical or magical item used to enhance the powers of a person or animal.
OpenAssistant
The OpenAssistant project is an attempt at replicating ChatGPT and other chat models through a coordinated open-source data collection and model training process. Users can navigate to the OpenAssistant website and participate in the process of labeling data for training. The OpenAssistant project has been continuously releasing models with open licenses. Their recent model is a Pythia model finetuned on data collected for the project. This model is based on the GPT-NeoX architecture from EleutherAI.
Again, you can use the OpenAssistant Pythia model today using Bumblebee:
Nx.default_backend(EXLA.Backend)
{:ok, model} = Bumblebee.load_model({:hf, "OpenAssistant/oasst-sft-1-pythia-12b"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "OpenAssistant/oasst-sft-1-pythia-12b"})
Then you can wrap the model in a generation serving:
serving = Bumblebee.Text.generation(model, tokenizer, defn_options: [compiler: EXLA])
One of the interesting things about OpenAssistant models is that they use special tokens to mark assistant and prompter portions of a conversation. They follow the chat-centric paradigm first introduced by ChatGPT:
Nx.Serving.run(serving, "<|prompter|>Elixir is a<|endoftext|><|assistant|>")
Notice you need to include <|prompter|>
and <|assistant|>
tokens before the generation. After running this you will see:
A programming language that is high-level, functional, and declarative.
Conclusion
The landscape of open-source large-language models is rapidly growing. As the open-source landscape becomes more competitive with OpenAI’s models, it will make more and more sense to migrate away from proprietary models and closed APIs. The beauty of the Elixir Nx ecosystem is that you can migrate to these open-source alternatives seamlessly. You can use LLMs today directly within your Elixir applications. The next generation of apps is LLM-powered, and I believe Elixir is the language of the LLM-powered future.
Ready to find out how DockYard can put the latest Elixir innovations to work for you? Contact us today.