# Creating Process Continuity in Distributed Elixir

by  
Ryan Spore  
December 12, 2022

## **Introduction**  
[Kohort](https://kohort.io/) is an opinionated remote meeting platform we built recently and use Daily at Mechanical Orchard. It lets distributed teams have meetings with more structure, transparency, and connection, with an integrated meeting agenda, live transcription, AI-summarization, and much more.

Whenever you join a meeting via Kohort, we spin up an Elixir process that will live for the length of the meeting. It does a variety of things, from sampling the transcript for keywords to orchestrating API calls as the meeting starts and ends. We need this process to have continuity throughout the length of a meeting (and meetings can last a long time amirite?). But machines die, deploys happen, and without care, our Meeting process continuity is lost.

If you also have long-running processes, you will also face this problem: What happens to a long-lived process when the machine it’s running on dies?

We really wanted the answer to that question to be “A new process starts immediately on another machine, picks up from where the dead process stopped, and everyone else starts talking to the new process without even noticing the difference.”

We got pretty close! I’ll tell you how.

## **Basic Setup**  
We started with [Horde](https://github.com/derekkraan/horde). Horde provides distributed DynamicSupervisor and Registry implementations, which we'll use to supervise our long-running processes and maintain their uniqueness.

```elixir
defmodule MyApp.Application do
  use Application

def start(_type, _args) do
    children = [
      # ... other processes in your application
      {Horde.Registry, [name: MyApp.DistributedRegistry, keys: :unique, members: :auto]},
      {Horde.DynamicSupervisor, [
        name: MyApp.DistributedSupervisor,
        strategy: :one_for_one,
        shutdown: 10_000,
        members: :auto
      ]},
      # ... other processes in your application
    ]

# ... other startup work
    Supervisor.start_link(children, opts)
  end

# ... other application logic
end
```

Let’s start with a long-running process GenServer skeleton, something like this:

```elixir
defmodule MyApp.Process do
  use GenServer, restart: :transient

def start_link(name) do
    via_tuple = {:via, Horde.Registry, {MyApp.DistributedRegistry, name}}
    case GenServer.start_link(__MODULE__, name, name: via_tuple) do
      {:ok, pid} -> {:ok, pid}
      {:error, {:already_started, _pid}} -> :ignore
    end
  end

@impl GenServer
  def init(name) do
    {:ok, name, {:continue, :post_init}}
  end

@impl GenServer
  def handle_continue(:post_init, name) do
    state = initial_state()
    {:noreply, state}
  end

@impl GenServer
  def handle_call({:trigger_manual_shutdown, shutdown_signal}, from, state) do
    {:stop, shutdown_signal, :ok, state}
  end

@impl GenServer
  def terminate(other, state), do: :ok

def via_tuple(name), do: # ... the work your process does ...
end
```

You can start it with our Horde supervisor:

```elixir
Horde.DynamicSupervisor.start_child(MyApp.DistributedSupervisor, {MyApp.Process, name})
```

## **Cluster Setup**  
To have meaningful handoff, you need your application to operate as a cluster. We use libcluster's kubernetes strategies to get our BEAM nodes talking together (see [this blog post](https://blog.differentpla.net/blog/2022/01/08/libcluster-kubernetes/) for an excellent guide for how to get those to play nicely).

Once you’ve got the application clustering, you can set up Horde to supervise your long-lived processes.

In this state, if the node your process is running on died, Horde would restart it on a different node, but with the same starting state that was provided in the start_child call.

## **Graceful State Handoff**  
In order to transfer state on the new node, we need to handle the exit signals Horde uses when a node dies. First, we need to trap exits:

```elixir
def init(name) do
  Process.flag(:trap_exit, true)
  {:ok, name, {:continue, :post_init}}
end
```

Here we intercept the normal shutdown message triggered by the node shutting down, and write some handoff state to the Horde. Registry meta, and then sleep to give the Registry a chance to sync with other nodes.

Your application needs to have some shared datastore that both dying and spawning nodes can communicate with. This could be a database or other external datastore, but we chose to use the Horde. Registry meta, which is kept synchronized across a cluster using a CRDT. In order to use this effectively, we needed to ensure our nodes continued to be clustered together until they shut down completely to enable this handoff, but were removed from the load balancer so users didn't connect to the dying nodes. Using the [pods lookup mode](https://blog.differentpla.net/blog/2022/01/08/libcluster-kubernetes/#lookup-mode-pods), we were able to keep pods clustered during their shutdown, after they've already been removed from the endpoint managing load balancing.

If your long-running process isn’t eternal, you need an alternate way to shutdown the process that clears the handoff state, like this:

```elixir
def terminate({:shutdown, :finished}, state) do
  Horde.Registry.put_meta(MyApp.DistributedRegistry, state.name, nil)
  {:noreply, state}
end
```

## **Clumsy State Handoff**  
Graceful shutdown isn’t guaranteed, so to hedge against a more clumsy handoff, we write as much state as is available eagerly, and add a more complex state recovery step. You can try reconstituting state from a database or event log.

```elixir
@impl GenServer
def handle_continue(:post_init, name) do
  state = 
    case Horde.Registry.meta(MyApp.DistributedRegistry, name) do
      {:ok, {:graceful_handoff, state}} -> state
      {:ok, {:clumsy_handoff, state}} -> attempt_state_recovery(state)
      _ -> initial_state()
    end
  Horde.Registry.put_meta(MyApp.DistributedRegistry, state.name, {:clumsy_handoff, state})
  {:noreply, state}
end
```

With all this in place, you have a system that moves long-running processes off of dying nodes, and uses the best available information to keep them running from where they left off.

## **Further Improvements**  
Horde's distribution is eventually consistent, so sometimes duplicate processes will get started. You can look into [merge conflict resolution](https://hexdocs.pm/horde/eventual_consistency.html#horde-dynamicsupervisor-merge-conflict) to handle the messages Horde uses to resolve these conflicts.

You can also enhance your clumsy handoff recovery by adding periodic writes to the clumsy handoff state or improving the state recovery logic when a new node attempts a clumsy recovery.

Elixir has proved to be an excellent (and enjoyable!) tool for building a modern, resilient, scalable RTC application. Phoenix and Liveview allow us to build responsive app quickly and OTP allows us to build a scalable and fault-tolerant video call experience. It’s been a pleasure working in this ecosystem to deliver an innovative product that’s always evolving and improving.
