Your digital product has to stand out in a crowded marketplace. Elixir’s stability and scalability help you do that. Book a free consult today to learn how we can use it to help you outperform your competitors.
Introduction
Bumblebee offers an assortment of pre-trained text, vision, audio, and diffusion models for you to use in your applications. It’s a great library for Elixir developers to build machine-learning applications without needing to introduce additional services. While Bumblebee supports a decent number of models, it’s not comprehensive. So, what do you do when you run into a model that’s not supported by Bumblebee? One option is to use Ortex. In this post, we’ll walk through what Ortex is and how it can be useful when building out your machine-learning application.
What is ONNX?
Before diving into what Ortex is, we need to spend some time discussing ONNX. ONNX is an open-source model serialization format supported by most major frameworks in the Python ecosystem. It stands for (O)pen (N)eural (N)etwork e(X)change. ONNX allows you to export pre-trained models into a common format and either load them into another library, or target different languages with support for the ONNX Runtime.
ONNX is a popular Python ecosystem and has the benefit of direct conversion support from almost all of the major machine-learning frameworks. An additional benefit of ONNX is that it is very portable—especially in targeting various accelerators. Due to its interoperability between frameworks, hardware manufacturers such as Groq often choose to build support for their hardware into the ONNX runtime because it gives them framework coverage immediately.
ONNX and the ONNX Runtime are also popular for embedded development. Many embedded platforms and embedded accelerators will support ONNX out of the box. If you are doing embedded development and want to add support for an embedded machine learning model, you will likely encounter models serialized as ONNX models or TensorFlow Lite models.
Many major companies already use ONNX and the ONNX runtime to some extent in production, which can make the cost of switching to a new machine-learning platform difficult. But what if you could make the switch without needing to migrate away from ONNX? Ortex makes this possible.
Ortex implements ONNX runtime bindings in Elixir via a Rust NIF. It makes the process of loading and running ONNX models in an Elixir application seamless. You can have an ONNX model running in your application as easily as:
model = Ortex.load("./models/resnet50.onnx")
Ortex.run(model, Nx.broadcast(0.0, {1, 3, 224, 224}))
Additionally, Ortex implements the Nx.Serving behaviour, which means you can turn an ONNX model into a production-ready server in a few minutes.
Using Ortex for Voice Activity Detection
One of my recent experiments in machine learning has been exploring using Elixir and Bumblebee for developing conversational AI applications. This process involves stitching together three models: speech-to-text, LLM, and text-to-speech. A part of making the conversation a bit more “realistic” is implementing smart “end-of-turn” detection. In a conversation, (some) humans are excellent at determining when it is their turn to speak. We pick up on small signals and cues from the other participants to determine when to interject. It’s difficult to express what these small cues look like to a computer; however, one way is to detect if there is voice activity in a given small time interval.
Voice activity detection (VAD) is a popular machine-learning field, especially in the context of conversational agents of the previous decade. Hardware devices such as Amazon’s Alexa and Google’s Assistant use voice activity detection and wake word detection to determine when a speaker is interacting with them. Given these are running on embedded platforms, the resulting models need to be lightweight enough to run very fast with a small amount of resources. One example of such a model is Silero VAD.
Silero is an enterprise-grade voice activity detection model. The library itself is written in Python; however, they have an open-source, lightweight ONNX model that is capable of running in 1 ms on a single CPU thread. Thanks to Ortex, we can have this VAD model up and running in a few minutes. First, you should download the ONNX model in the Siler VAD repo. Next, install the following dependencies:
Mix.install([
{:nx, "~> 0.7"},
{:ortex, "~> 0.1"},
{:kino_live_audio, "~> 0.1"},
{:kino_vega_lite, "~> 0.1.10"},
{:bumblebee, "~> 0.5"},
{:exla, ">= 0.0.0"}
])
This installs Nx, Ortex, and some Kino dependencies for working in a Livebook. Next, load the Silero ONNX model into your application:
model = Ortex.load("./silero_vad.onnx")
You’ll notice this model has four inputs and three outputs. The inputs respectively represent:
input
- the input audio sample in PCM formatsr
- the sampling rate of the Audio. Either 8000 or 16000h
- LSTM hidden statec
- LSTM hidden state
Silero is a tiny LSTM model—which is why you need h
and c
as inputs. LSTM’s are stateful, so you’ll notice that the output returns updated versions of both h
and n
. The shape of the input is 2-dimensions. The first dimension is the batch size, followed by the number of samples. The number of samples cannot be less than the number of samples in 30 ms. In other words, for a sampling rate of 16_000
, the number of samples cannot be less than 480.
Next, create a new VegaLite plot using:
chart =
VegaLite.new(title: "Voice-Activated Detection", width: 800, height: 400)
|> VegaLite.mark(:line)
|> VegaLite.encode_field(:x, "x",
type: :quantitative,
title: "Time",
axis: [ticks: false, domain: false, grid: false, labels: false]
)
|> VegaLite.encode_field(:y, "y",
type: :quantitative,
title: "Voice",
scale: [domain_max: 1, domain_min: 0]
)
|> Kino.VegaLite.new()
This will create a live chart that plots the probability that a voice is detected at a given time interval. Now, declare a new live audio field using:
live_audio = KinoLiveAudio.new(chunk_size: 30, unit: :ms, sample_rate: 16_000)
Finally, you can stream the input live audio and update your Kino graph using:
init_state = %{h: Nx.broadcast(0.0, {2, 1, 64}), c: Nx.broadcast(0.0, {2, 1, 64})}
live_audio
|> Kino.Control.stream()
|> Kino.listen(init_state, fn
%{event: :audio_chunk, chunk: data}, %{h: h, c: c} ->
input = Nx.tensor([data])
sr = Nx.tensor(16_000, type: :s64)
{output, hn, cn} = Ortex.run(model, {input, sr, h, c})
prob = output |> Nx.squeeze() |> Nx.to_number()
row = %{x: :os.system_time(), y: prob}
Kino.VegaLite.push(chart, row, window: 1000)
{:cont, %{h: hn, c: cn}}
end)
This will listen to your live audio stream and update the Kino graph with the probability that a voice is present in a given time window. If you start recording, you’ll notice your graph dip up and down as you speak! The Kino stream uses a stateful listener to continuously update c
and h
. Inside the listener, you use Ortex
to run your model, and then update the graph with Nx. Notice how easy it was to get up and running with Ortex? With just a few lines of code, we were able to import an enterprise-grade model into Elixir, and use it without having to jump through any hoops!
Combining Ortex and Bumblebee
Ortex is a great supplement to Bumblebee, as you can take advantage of lightweight traditional models in conjunction with powerful pre-trained transformer models. This is exactly what we do in Echo. Echo is a small server for building conversational assistants—it makes use of this exact Silero VAD model in conjunction with Whisper to first determine exactly when somebody has finished speaking and then to determine exactly what they said.
Once again we’ll declare a live audio component:
live_audio = KinoLiveAudio.new(chunk_size: 100, unit: :ms, sample_rate: 16_000)
Next, create a speech-to-text serving with Bumblebee:
repo = {:hf, "distil-whisper/distil-small.en"}
{:ok, model_info} = Bumblebee.load_model(repo)
{:ok, featurizer} = Bumblebee.load_featurizer(repo)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
serving =
Bumblebee.Audio.speech_to_text_whisper(model_info, featurizer, tokenizer, generation_config,
task: nil,
compile: [batch_size: 1],
defn_options: [compiler: EXLA]
)
Next, you can adjust your listener to accumulate audio as it streams in if somebody is speaking:
init_state = %{
h: Nx.broadcast(0.0, {2, 1, 64}),
c: Nx.broadcast(0.0, {2, 1, 64}),
mode: :waiting,
audio: []
}
live_audio
|> Kino.Control.stream()
|> Kino.listen(init_state, fn
%{event: :audio_chunk, chunk: data}, %{h: h, c: c, mode: mode, audio: audio} ->
input = Nx.tensor([data])
sr = Nx.tensor(16_000, type: :s64)
{output, hn, cn} = Ortex.run(model, {input, sr, h, c})
prob = output |> Nx.squeeze() |> Nx.to_number()
IO.write(prob)
cond do
prob >= 0.5 and mode == :waiting ->
{:cont, %{h: hn, c: cn, mode: :listening, audio: data}}
prob >= 0.5 and mode == :listening ->
{:cont, %{h: hn, c: cn, mode: :listening, audio: audio ++ data}}
prob < 0.5 and mode == :listening ->
transcription = Nx.Serving.run(serving, Nx.tensor([audio]))
IO.inspect(transcription)
:halt
prob < 0.5 and mode == :waiting ->
{:cont, %{h: hn, c: cn, mode: :waiting, audio: []}}
end
end)
This will repeatedly run VAD and accumulate audio once you start speaking. When you stop speaking, it will stop recording and run transcription on the input. This is a simple way to implement a more realistic conversational agent—thanks in part to the magic of Ortex.
Conclusion
Ortex is an important piece of the Elixir ML ecosystem. It enables users to easily migrate their existing ONNX models to an Elixir application without needing to jump through any conversion hoops. If you are working with ONNX models and interested in Elixir, I highly recommend you give Ortex a shot! Until next time!