Whether you need machine learning know-how, design to take your digital product to the next level, or strategy to set your roadmap, we can help. Book a free consult to learn more.
The idea of “self-healing” software applications has been around for about as long as software has been around. The imagined future is that we will have software systems capable of repairing and redeploying themselves in the event of errors, without intervention from programmers. Self-healing systems work in essentially three parts, per Snapstack Solutions:
-
Detection - A self-healing system is capable of detecting when something has gone wrong. In the simplest case, detection can just look for exceptions or errors raised during application execution.
-
Isolation - A self-healing system is capable of isolating the error to a particular component or part of the application.
-
Recovery - This is the most important part of a self-healing system: A self-healing system can take the isolated issue and autonomously apply a fix for that particular issue.
Self-healing software does exist in a narrow sense today. You are actually likely already familiar with one narrow method for self-healing: autoscaling. Autoscaling is a method for dynamically adjusting the amount of computational resources allocated to an application based on the load on the application. In the event of increased load on an application, autoscaling can be seen as a method for self-healing, as the autoscalers detect degraded application performance, attribute the degraded performance to increased load, and “fix” the issue by horizontally or vertically scaling.
Broad self-healing software is likely some time away. However, that doesn’t mean we can’t have fun with the concept. Generative AI has certainly brought us significantly closer to software that repairs itself. In this post, we’ll combine the power of the BEAM with the power of large language models to build a self-healing system.
A Simple LiveView
Let’s start by creating an extremely simple LiveView application that lets us change the color of a dot on the screen
(riveting!). We’ll use this LiveView to simulate the most common type of software error: programmer error. First,
implement mount/3
and render/1
functions like this:
defmodule SelfHealingWeb.ColorsLive do
use SelfHealingWeb, :live_view
@impl true
def mount(_params, _session, socket) do
{:ok, assign(socket, color: "blue")}
end
@impl true
def render(assigns) do
~H"""
<div class="text-center">
<h1 class="text-xl">Colors!</h1>
<h2 class="text-lg">Change the color of the dot by picking a new color.</h2>
<div class="mt-4 flex justify-center space-x-4">
<.color_button color="blue" class="bg-blue-400 hover:bg-blue-700" />
<.color_button color="red" class="bg-red-400 hover:bg-red-700" />
<.color_button color="green" class="bg-green-400 hover:bg-green-700" />
<.color_button color="indigo" class="bg-indigo-400 hover:bg-indigo-700" />
</div>
<div class="mt-8 flex items-center justify-center">
<div class={[class_for_color(@color), "h-64 w-64 rounded-full"]}></div>
</div>
</div>
"""
end
end
This will assign some state that tracks the current color of the dot in the socket. render/1
will render the dot on
the screen with a color that matches. Next, add the color_button
and class_for_color
helper functions to the module:
defp color_button(assigns) do
~H"""
<button phx-click="change_color" phx-value-color={@color} class={[@class, "text-white font-bold py-2 px-4 rounded"]}>
<%= String.capitalize(@color) %>
</button>
"""
end
defp class_for_color(color) do
case color do
"blue" -> "bg-blue-400"
"red" -> "bg-red-400"
"green" -> "bg-green-400"
end
end
This code has an intentional bug. class_for_color/1
is missing a clause for “indigo”, which means when we change the
value of color
, this process will crash. Let’s add an event handler for change_color
and then see what happens when
we change colors:
@impl true
def handle_event("change_color", %{"color" => color}, socket) do
{:noreply, assign(socket, color: color)}
end
Now if you click all of the buttons in order, you’ll see this:
Notice that when we click on Indigo
, our LiveView crashes! We could simply add a new clause to our case statement, but
can we get our program to repair itself?
Detection
The first component of a self-healing system is detection. In our application, we know something has gone wrong because our program raised an exception. But, how can we get access to this exception before it has the chance to crash our program?
Remember that your LiveView is a process. It has callbacks for dealing with the lifecycle of that process. When an exceptional situation occurs within the process (e.g. an application exception) the process crashes and then restarts. This is why, when exceptions occur, you see that error flash and then the topbar attempts to reconnect the client to the server.
When a LiveView process exits, it invokes the terminate/2
callback. The terminate
callback will be invoked even in
the case of shutdowns, disconnections, or otherwise “normal” exits. In the case of our exception, we’ll get the
information we need to isolate the issue. First, we can add a few clauses for termimnate/2
to handle “normal”
disconnections:
@impl true
def terminate(:normal, socket), do: :ok
def terminate(:shutdown, socket), do: :ok
def terminate({:shutdown, :closed}, socket), do: :ok
Next, we can add a clause for “non-normal” situations:
def terminate(reason, socket), do: IO.inspect reason
If you run this and trigger the exception again, you’ll get an output that looks like:
{{:case_clause, "indigo"},
[
{SelfHealingWeb.ColorsLive, :class_for_color, 1, [file: ~c"lib/self_healing_web/live/colors_live.ex", line: 41]},
{SelfHealingWeb.ColorsLive, :"-render/1-fun-2-", 2, [file: ~c"lib/self_healing_web/live/colors_live.ex", line: 22]},
{Phoenix.LiveView.Diff, :traverse, 7, [file: ~c"lib/phoenix_live_view/diff.ex", line: 368]},
{Phoenix.LiveView.Diff, :"-traverse_dynamic/7-fun-0-", 4, [file: ~c"lib/phoenix_live_view/diff.ex", line: 546]},
{Enum, :"-reduce/3-lists^foldl/2-0-", 3, [file: ~c"lib/enum.ex", line: 2510]},
{Phoenix.LiveView.Diff, :traverse, 7, [file: ~c"lib/phoenix_live_view/diff.ex", line: 366]},
{Phoenix.LiveView.Diff, :render, 3, [file: ~c"lib/phoenix_live_view/diff.ex", line: 136]},
{Phoenix.LiveView.Channel, :"-render_diff/3-fun-0-", 4, [file: ~c"lib/phoenix_live_view/channel.ex", line: 962]},
{:telemetry, :span, 3,
[
file: ~c"/Users/sean/projects/self_healing/deps/telemetry/src/telemetry.erl",
line: 321
]},
{Phoenix.LiveView.Channel, :render_diff, 3, [file: ~c"lib/phoenix_live_view/channel.ex", line: 957]},
{Phoenix.LiveView.Channel, :handle_changed, 4, [file: ~c"lib/phoenix_live_view/channel.ex", line: 804]},
{:gen_server, :try_handle_info, 3, [file: ~c"gen_server.erl", line: 1077]},
{:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 1165]},
{:proc_lib, :wake_up, 3, [file: ~c"proc_lib.erl", line: 251]}
]}
This is an exception with a Stacktrace. We can turn this into a readable error message using Exception.format_exit/1
:
def terminate(reason, socket) do
exception = Exception.format_exit(reason)
IO.write(exception)
end
Which turns our exception into this output:
an exception was raised:
** (CaseClauseError) no case clause matching: "indigo"
(self_healing 0.1.0) lib/self_healing_web/live/colors_live.ex:40: SelfHealingWeb.ColorsLive.class_for_color/1
(self_healing 0.1.0) lib/self_healing_web/live/colors_live.ex:21: anonymous fn/2 in SelfHealingWeb.ColorsLive.render/1
(phoenix_live_view 0.20.14) lib/phoenix_live_view/diff.ex:368: Phoenix.LiveView.Diff.traverse/7
(phoenix_live_view 0.20.14) lib/phoenix_live_view/diff.ex:546: anonymous fn/4 in
Phoenix.LiveView.Diff.traverse_dynamic/7
(elixir 1.15.7) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
(phoenix_live_view 0.20.14) lib/phoenix_live_view/diff.ex:366: Phoenix.LiveView.Diff.traverse/7
(phoenix_live_view 0.20.14) lib/phoenix_live_view/diff.ex:136: Phoenix.LiveView.Diff.render/3
(phoenix_live_view 0.20.14) lib/phoenix_live_view/channel.ex:962: anonymous fn/4 in
Phoenix.LiveView.Channel.render_diff/3
(telemetry 1.2.1) /Users/sean/projects/self_healing/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
(phoenix_live_view 0.20.14) lib/phoenix_live_view/channel.ex:957: Phoenix.LiveView.Channel.render_diff/3
(phoenix_live_view 0.20.14) lib/phoenix_live_view/channel.ex:804: Phoenix.LiveView.Channel.handle_changed/4
(stdlib 5.0.2) gen_server.erl:1077: :gen_server.try_handle_info/3
(stdlib 5.0.2) gen_server.erl:1165: :gen_server.handle_msg/6
(stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
This is pretty much all we need for the “detection” part of our self-healing system.
Isolation
The next component of a self-healing system is isolation. Isolation is the process of identifying the component of the application that led to an error. For our simple case, we’re only worried about a single function leading to a crash. In a real program, a crash can be caused by a series of cascading failures spanning several modules. We won’t worry about that for this trivial example.
We could probably come up with some clever ways to parse a stacktrace by hand in order to identify the part of our code that led to an exception. But we live in the age of LLMs, so why not take advantage of that? We can use Instructor and ChatGPT to parse out the particular file, function, and line that led to our crash. First, we need to install Instructor:
defp deps do
[
# ...
{:instructor, "~> 0.0.5"}
]
end
Instructor is a library for extracting structured outputs from large language models. It allows us to pass an Ecto schema alongside a prompt to an LLM, and then receive the model’s output back as the Ecto schema we provided.
After installing Instructor, we can define an isolate/1
function, which takes our formatted exception and “isolates”
the issue:
defmodule SelfHealing.Isolator do
def isolate(exception) do
{:ok, isolated} =
Instructor.chat_completion(
model: "gpt-4",
messages: [
%{role: "system", content: "Identify the file, line, and function which caused this exception."},
%{role: "user", content: exception}
],
response_model: %{file: :string, line: :integer, function: :string}
)
isolated
end
end
This function will take our exception and return a structured output that matches the response model we gave it. In this case, we’ll get back a file, line, and function name.
Now we can call our isolator inside the terminate/2
callback:
def terminate(reason, socket) do
exception = Exception.format_exit(reason)
SelfHealing.Isolator.isolate(exception) |> IO.inspect
end
And after a bit of a delay we get:
%{
function: "SelfHealingWeb.ColorsLive.class_for_color/1",
line: 40,
file: "lib/self_healing_web/live/colors_live.ex"
}
That was easy! Unfortunately, though, this is where things start to get a little hairy. We need to convert this
information into actual source code. In other words, we need to retrieve the actual source for class_for_color/1
at
runtime so we can give it to an LLM to repair. If you’re working in a production setting, you generally won’t build the
source into your release. You could use the GitHub API to grab the source for your particular file(s) of interest.
Fortunately, we’re not working in a production setting, so we can just read the source as normal:
def isolate(exception) do
{:ok, cause} =
Instructor.chat_completion(
model: "gpt-4",
messages: [
%{role: "system", content: "Identify the file, line, and function which caused this exception."},
%{role: "user", content: exception}
],
response_model: %{file: :string, line: :integer, function: :string}
)
File.read!(cause.file)
end
Then, we can use Sourceror to extract the source information for the
class_for_color/1
function:
def isolate(exception) do
{:ok, cause} =
Instructor.chat_completion(
model: "gpt-4",
messages: [
%{role: "system", content: "Identify the file, line, and function which caused this exception."},
%{role: "user", content: exception}
],
response_model: %{file: :string, line: :integer, function: :string}
)
source = File.read!(cause.file)
{first, last} = lines_for_function(source, cause.function)
source =
source
|> String.split("\n")
|> Enum.slice((first - 1)..last)
|> Enum.join("\n")
Map.put(cause, :source, source)
end
def lines_for_function(code, function) do
ast = Sourceror.parse_string!(code)
{fun, arity} = parse_function(function)
case find_function(ast, fun) do
nodes when is_list(nodes) ->
{_, meta, _args} = hd(nodes)
{meta[:do][:line], meta[:end][:line]}
nil ->
raise "oh no"
end
end
defp find_function(ast, fun) do
Macro.path(ast, fn
{:defp, _, [{^fun, _, _args} | _]} -> true
{:def, _, [{^fun, _, _args} | _]} -> true
_ -> false
end)
end
This won’t work in all situations and I wouldn’t recommend doing this on production code. But I wouldn’t recommend doing any of what we’re trying to do in production code either. Now, if you trigger the exception again, you should see the source for the function we’re working with! Now we can move on to the final stage: recovery.
Recovery
At this point, we’ve successfully built our application to detect and isolate software issues. Now we need to get it to
repair itself. To do this, we’ll first add a simple prompt that sends our context and exception off to GPT-4 to suggest
a fix. First, you’ll need to install the openai
package:
defp deps do
[
# ...
{:openai, "~> 0.6.1"}
]
end
Then, we can format the cause we isolated in the previous section into a prompt:
defmodule SelfHealing.Repair do
def repair(%{exception: exception, source: source} = cause) do
{:ok, response} =
OpenAI.chat_completion(
model: "gpt-4",
messages: [%{role: "user", content: prompt(exception, source)}]
)
get_in(response, [:choices, Access.at(0), "message", "content"])
end
defp prompt(exception, source) do
"""
Given the following exception which has been raised in the given source \
function, return a fixed version of the function. Absolutely DO NOT return \
anything but valid Elixir code. If you return anything but valid Elixir code \
very bad things will happen. Do NOT put your code in code blocks either.
Exception: #{exception}
Source: #{source}
Code:
"""
end
end
Now, if we update our terminate/2
callback to:
def terminate(reason, socket) do
reason
|> Exception.format_exit()
|> SelfHealing.Isolator.isolate()
|> SelfHealing.Repair.repair()
|> IO.inspect
end
and trigger the exception again, you should see this in the console:
defp class_for_color(color) do
case color do
"blue" -> "bg-blue-400"
"red" -> "bg-red-400"
"green" -> "bg-green-400"
"indigo" -> "bg-indigo-400"
end
end
Exactly what we wanted! Now, we need to take this and replace the original function lines in our source with it, without
overwriting any of the other functions in the module. We can do this with a simple String.replace/3
:
def repair(%{exception: exception, source: source, file: file}) do
{:ok, response} =
OpenAI.chat_completion(
model: "gpt-4",
messages: [%{role: "user", content: prompt(exception, source)}]
)
fixed = get_in(response, [:choices, Access.at(0), "message", "content"])
String.replace(File.read!(file), source, fixed)
end
Once we’ve updated the source file, we can compile the module on the fly with Code.compile_string/2
:
def repair(%{exception: exception, source: source, file: file}) do
{:ok, response} =
OpenAI.chat_completion(
model: "gpt-4",
messages: [%{role: "user", content: prompt(exception, source)}]
)
fixed = get_in(response, [:choices, Access.at(0), "message", "content"])
String.replace(File.read!(file), source, fixed)
|> Code.compile_string()
end
This will compile the module, and it should replace the original compiled version of this module within your application. Now you can trigger your exception again and it will error. Then, you can try to click the indigo button again, and your circle will turn the proper color!
Conclusion
This is a pretty silly example, but, hopefully, it gets you thinking about some of the things that are possible in the future as LLMs become more capable. Issue detection and isolation in software systems are extremely valuable. Detecting and isolating can be significantly sped up with the use of LLMs. Additionally, LLMs that suggest fixes for simple issues like this can save you a lot of time–even if you don’t actually automatically ship the repair.
Additionally, if LLMs actually do increase exponentially in capability in the next few years, and self-healing is both desirable and possible, it’s nice to know that Elixir and Erlang make this process relatively painless. Until next time!