This is the tale of cool bugs.
Working on BlockScout for POA Network, we’ve moved beyond indexing just POA Network’s core and test chains and started to test that BlockScout can index Ethereum Mainnet and Ethereum Classic (ETC). The goal of BlockScout is to have an open-source alternative to the closed-source Etherscan that the majority of the Ethereum ecosystem is currently dependent on.
While testing against Ethereum Classic, we ran into a bug where the part of the Indexer OTP app, Indexer.InternalTransactionFetcher kept timing out on a GenServer.call.
04:40:55.756 [error] Task #PID<0.18499.21> started from #PID<0.18448.21> terminating
** (stop) exited in: GenServer.call(Indexer.InternalTransactionFetcher, {:buffer, [...]}, 10000)
** (EXIT) time out
(elixir) lib/gen_server.ex:924: GenServer.call/3
(indexer) lib/indexer/block_fetcher/catchup.ex:140: Indexer.BlockFetcher.Catchup.async_import_remaining_block_data/2
(indexer) lib/indexer/block_fetcher/catchup.ex:107: Indexer.BlockFetcher.Catchup.import/2
(indexer) lib/indexer/block_fetcher.ex:114: Indexer.BlockFetcher.fetch_and_import_range/2
(indexer) lib/indexer/block_fetcher/catchup.ex:167: Indexer.BlockFetcher.Catchup.fetch_and_import_range_from_sequence/3
(elixir) lib/task/supervised.ex:89: Task.Supervised.do_apply/2
(elixir) lib/task/supervised.ex:38: Task.Supervised.reply/5
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Function: &:erlang.apply/2
Args: [#Function<8.121428738/1 in Indexer.BlockFetcher.Catchup.stream_fetch_and_import/2>, [5569266..5569257]]
Because of the awesomeness that is OTP supervision trees, these timeouts weren’t killing the app. The timeouts were however making the indexing very slow. This part of the tree had to restart, try again, make some progress, and then hit timeouts again.
Tracing the code, the following line in the stacktrace:
(indexer) lib/indexer/block_fetcher/catchup.ex:140: Indexer.BlockFetcher.Catchup.async_import_remaining_block_data/2
corresponded to:
blockscout/apps/indexer/lib/indexer/block_fetcher/catchup.ex Line 140 in d106a2c
|> InternalTransactionFetcher.async_fetch(10_000)
This means that the async_fetch
is taking longer than 10 seconds (10_000
ms = 10
s).
Internally, the timeout is used for this GenServer.call
:
blockscout/apps/indexer/lib/indexer/internal_transaction_fetcher.ex Lines 42 to 47 in d106a2c
@spec async_fetch([%{required(:block_number) => Block.block_number(), required(:hash) => Hash.Full.t()}]) :: :ok
def async_fetch(transactions_fields, timeout \\ 5000) when is_list(transactions_fields) do
params_list = Enum.map(transactions_fields, &transaction_fields_to_params/1)
BufferedTask.buffer(__MODULE__, params_list, timeout)
end
blockscout/apps/indexer/lib/indexer/buffered_task.ex Lines 157 to 159 in d106a2c
def buffer(server, entries, timeout \\ 5000) when is_list(entries) do
GenServer.call(server, {:buffer, entries}, timeout)
end
When the call is received in the server it buffers the entries
by appending them to the state
’s current_buffer
:
blockscout/apps/indexer/lib/indexer/buffered_task.ex Lines 266 to 268 in d106a2c
def handle_call({:buffer, entries}, _from, state) do
{:reply, :ok, buffer_entries(state, entries)}
end
blockscout/apps/indexer/lib/indexer/buffered_task.ex Lines 288 to 292 in d106a2c
defp buffer_entries(state, []), do: state
defp buffer_entries(state, entries) do
%{state | current_buffer: [entries | state.current_buffer]}
end
This is a simple append to a list, which is fast, so the only thing that could take a lot of time is if there is too much data being transferred between processes.
However, it could also be that the handle_call/3
is not even called yet and instead, the InternalTransactionFetcher
is too busy doing something else. More investigation is needed.
Tracing
At this point, I couldn’t be sure if the GenServer.call
never returned or if it just took longer than 10 seconds to return. To determine which was the case, I turned on tracing. The InternalTransactionFetcher
is a callback module for BufferedTask
, which is built on top of GenServer
. GenServer
supports :debug
options to turn on :sys.trace
when the GenServer
is started. We can actually allow these to be passed when the BufferedTask
is started as a worker, but we needed to allow all GenServer
options to be passed through.
Turning on tracing for the InternalTransactionFetcher
shows that the InternalTransactionFetcher
eventually responds to the request, so it is simply the timeout being too short.
At this point there were 3 paths forwards:
- Simply increase the timeout. Since the timeout is 10 seconds now, anything that hit the timeout before will of course take longer than 10 seconds.
- Change from
GenServer.call
toGenServer.cast
. - Investigate why
InternalTransactionFetcher
is taking 10 seconds to service the request.
Why 10 seconds wasn’t good enough wasn’t obvious to me since it should involve a simple copy of a list over a GenServer.call
message, so I didn’t feel like (1) would cause the actual root problem. I was too worried about (2) leading to mailbox flood as it would remove backpressure. That left (3): figure out why InternalTransactionFetcher
needs more than 10 seconds. It was the harder path, but better long-term.
Profiling
I couldn’t remember how to configure tracing of calls to get call timing stats, so instead, I decided to use :eprof
. Completely unexpectedly, erlang:send/3
took the most time!
Function | Calls | % | Time | µs / Call |
---|---|---|---|---|
Indexer.BufferedTask.buffer_entries/2 |
3145 | 0.16 | 98186 | 31.22 |
Elixir.Enum."-chunk_while/4-fun-0-"/3 |
213086 | 0.16 | 99449 | 0.47 |
:lists.do_flatten/2 |
219409 | 0.19 | 119474 | 0.54 |
Stream.Reducers."-chunk_every/5-fun-0-"/5 |
213086 | 0.23 | 146141 | 0.69 |
:lists.reverse/2 |
21407 | 0.27 | 169811 | 7.93 |
:gen_server.try_dispatch/3 |
4568 | 0.61 | 377158 | 82.57 |
:queue.len/1 |
2250 | 3.40 | 2117414 | 941.07 |
:erlang.send/3 |
2250 | 94.28 | 58679566 | 26079.81 |
Since send
s were dominating by so much time, I wanted to check if we were backing up processing messages. Using :observer
, I was able to get the stack trace from the named Indexer.InternalTransactionFetcher
process. The Stack Trace tab often showed Indexer.InternalTransactionFetcher
busy spawning the next batch.
Determining the cause of the slow send was going to be difficult though, because Indexer.TaskSupervisor
was a Task.Supervisor
used for all Task
s in Indexer
, not just for Indexer.InternalTransactionFetcher
‘s Task
s. I decided to restructure the supervision tree, so that each “fetcher” that spawned Task
s would have its own Task.Supervisor
and I would be able to attribute memory usage to specific fetchers and ensure that other fetchers weren’t slowing down Indexer.InternalTransactionFetcher
talking to Indexer.TaskSupervisor
.
Before:
After:
With Indexer.InternalTransaction.Fetcher
using its own Task.Supervisor
, Indexer.TransactionFetcher.TaskSupervisor
, I was able to see that Indexer.InternalTransaction.Fetcher
alone is the problem as the timeout still occurred.
When there are performance issues, the easiest thing to do with :observer
:
- Go to the “Processes” tab.
- Sort columns to find outliers.
- Click the “Memory” column to see if any process has an excessive memory usage.
- Click the “MsgQ” column to see if any process has a message queue size. It is rare that messages stay in queues long enough to be visible in Observer’s GUI.
In our case, Indexer.InternalTransaction.TaskSupervisor
was using 31 MB (!) of memory after only a few minutes and unlike everything else in the system, Indexer.InternalTransaction.Fetcher
had a message queue. When I let it run for longer, the memory just kept growing as did the message queue. Whatever was causing the memory leak in Indexer.InternalTransaction.TaskSupervisor
was slowing down Indexer.InternalTransactionFetcher
to make it have a message queue. The memory leak is the root cause. If I fixed that I would probably fix the message queue and therefore, the original timeout that was reported in the bug.
I had no idea why Indexer.InternalTransaction.TaskSupervisor
should really have any noticeable amount of memory, since each Task
it is supervising should be a batch of only 10 JSONRPC requests. I added some print statements in Indexer.InternalTransactionFetcher
and Indexer.BufferedTask
and they both showed just 10 JSONRPC requests of reasonable size.
Grasping at any efficiency problems I could, I was thinking it might be a problem with REFC binaries, because we’re using a lot of JSON parsing. However, it was unlikely that the library would have a leak problem like this, as it would have affected other people as well. That left a problem with how data was transferred to the Task.Supervisor
, so I checked the Erlang Efficiency Guide’s Processes chapter for something we were doing wrong.
I was drawn to the Loss of Sharing section
Shared subterms are not preserved in the following cases:
- When a term is sent to another process
- When a term is passed as the initial process arguments in the spawn call
- When a term is stored in an Ets table
So my hypothesis was that the batches had shared subterms that were getting duplicated. We also were using Task.Supervisor.async_nolink/3
with an anonymous function.
[blockscout/apps/indexer/lib/indexer/buffered_task.ex Lines 379 to 381 in 6aaa300]
Task.Supervisor.async_nolink(state.task_supervisor, fn ->
{:performed, state.callback_module.run(batch, retries, state.callback_module_state)}
end)
I was worried this anonymous’s closure was somehow leading to the memory bloat – that it was grabbing too much from the Indexer.InternalTransactionFetcher
as the environment for the closure – so, I decided to switch to Task.Supervisor.async_nolink/5
.
[blockscout/apps/indexer/lib/indexer/buffered_task.ex Lines 379 to 383 in 5b7872f]
Task.Supervisor.async_nolink(state.task_supervisor, state.callback_module, :run, [
batch,
retries,
state.callback_module_state
])
It worked!
After the change, Indexer.InternalTransaction.TaskSupervisor
disappears from the top memory usage and Indexer.InternalTransaction.Fetcher
no longer has a MsgQ.
I fixed the bug, but I wasn’t satisfied: I wanted to know why using a closure led to such problems. Thankfully, on the BlockScout project, POA Network also hired Plataformatec to work on another section of the application, so I could ask José Valim to explain why this bug happened.
So, the root cause of the bug and the memory leak was because Indexer.BufferedTask
extracted the callback_module
and callback_module_state
fields from state
inside of the anonymous function. The entire state
of the Indexer.BufferedTask
was captured by the closure. That state
contained the entire :queue
of future work, so every time Indexer.BlockScout.spawn_next_batch
was sending a batch of 10 requests to the Task.Supervisor
it was also copying the entire queue
of all the future work too.
To me that still left the question of why the memory was attributed to the Task.Supervisor
instead of the individual Task
s. Thankfully, we have some OTP experts here at DockYard too and Paul Schoenfelder was able to explain why the Task.Supervisor
would use memory instead of the Task
only.
Takeaway
When using anonymous functions with Task.Supervisor.async_nolink
or any other function that runs the anonymous function in another process, make sure that you don’t capture more than is absolutely needed to do the work in the process. Most importantly, don’t do state.field
in the anonymous function as it will force the whole state
to be in the environment for closure. When the anonymous function passes through a Supervisor
, as is the case with the DynamicSupervisor
that powers Task.Supervisor
, memory can be held in the copy of the child spec for the anonymous function’s Task
.
You can detect this type of problem by using :observer
:
- If your
Task.Supervisor
s have memory measured in tens of MBs instead of KBs in the Memory column of the Processes table. - If your
Tasks.Supervisor.async_nolink
callers have a non-zero MsgQ.
But, at the very beginning this can all just appear as a simple proximate cause:
- If your
Tasks.Supervisor.async_nolink
callers are timing out.
Epilogue
Once the changes were in, I could see that memory usage for the Task.Supervisor
and message queue length were all fixed, which fixed the timeouts in the original bug, but what I wasn’t expecting was the speed improvement. Andrew Cravenho from POA Network reported that he was able to index all blocks on POA’s Sokol test chain in 1 hour. Our previous best time was 5.5 hours a few months ago when the chain was shorter, so not only did we eliminate timeouts, queues, and memory leaks, we made the overall performance 5x faster.
This bug was in the project for almost its entire history, but it never became a problem until we tested against a specific dataset (Ethereum Classic) that made the problem bad enough to trigger timeouts. Even if you don’t have timeouts, it is still worth checking in :observer
to see if you have any processes that are an outlier for memory or MsgQ they may indicate a hidden bug in your code that OTP is compensating for.
DockYard is a digital product agency offering exceptional user experience, design, full stack engineering, web app development, software, Ember, Elixir, and Phoenix services, consulting, and training.