High Availability

To aid in overall system availability, my Home Assistant deployment operates as an Active/Standby pair. This is predicated on MQTT Clustering to enable states to be consistent between Home Assistant instances. This is coupled with maintaining a consistent configuration between the two instances using git.

Beyond this, a couple of small tricks are required in order to get two Home Assistant instances to place nicely together.

Variables

There are a number of settings which must be unique for each Home Assistant instance in a cluster. This includes:

name
hostname
internal_url (and possibly external_url, depending on your setup)

Different methods exist for keeping these values unique within a common configuration, however a quick hack is to use the secrets file. Ordinarily the secrets.yaml file is used for storing passwords and secret keys in a single file and masking these values from the main configuration. This file can be used as a pseudo-variable system for storing values that must be configured uniquely for each instance.

secrets.yaml

# Tempates for hosts
hostname: leonard
partner: johnson
name: Home
google_report_state: true

configuration.yaml

homeassistant:
...
  name: !secret name
  ...
  internal_url: !secret internal_url

Active/Standby State

Only once instance should be active at any given time. This requires a way of suppressing the automations from running, as well as detected when the active instance has failed.

This starts off with a heartbeat automation that publishes the instances IP address every 60 seconds;

- id: '1583253390883'
  alias: HASS:Heartbeat
  description: ''
  trigger:
  - platform: time_pattern
    seconds: '0'
  condition:
  - condition: state
    entity_id: binary_sensor.active
    state: 'on'
  action:
  - data:
      payload: "{{ states('sensor.local_ip') }}"
      topic: home/hass/active
    service: mqtt.publish

This works in concert with a binary_sensor that compares the published IP to the instance IP:

binary_sensor:
  - platform: template
    sensors:
      active:
        friendly_name: "Active IP"
        value_template: "{{ states('sensor.partner') == states('sensor.local_ip') or is_state('sensor.partner', 'OFFLINE') or is_state('sensor.partner', 'unavailable') }}"

Each automation then includes a condition to check the state of this sensor:

  condition:
  - condition: state
    entity_id: binary_sensor.active
    state: 'on'

Control can be manually transferred between instances by calling the heartbeat automation on the standby instance. This immediately publishes the IP address, toggling the state of the binary sensor in each instance.

Active/Standby State (Old method)

This is the previous method I used for transferring control between the two instances. It had the advantage of not needing a condition in each automation, but could get itself unstuck if all of the automations failed to enable or disable correctly.

The currently active instance was recorded in the following MQTT topic:

home/hass/active

This topic is used to populate a sensor that records the currently active instance:

sensor:
  ..
  platform: mqtt
  state_topic: "home/hass/active"
  name: partner
  expire_after: 65

The sensor will revert to unknown after 65 seconds. Coupling the sensor with birth and will messages for Home Assistant, allows an active instance to hand-over control during shutdown.

mqtt:
  ..
  birth_message:
    topic: 'home/hass/active'
    payload: !secret hostname
    retain: 'false'
  will_message:
    topic: 'home/hass/active'
    payload: !secret partner
    retain: 'false'

Additionally, to cover the scenario for a sudden loss of service, such as a crash or an ungraceful shutdown a heartbeat automation is used.

automation:
..
- id: '1583253390883'
  alias: HASS:Heartbeat
  description: ''
  trigger:
  - platform: time_pattern
    seconds: '0'
  condition: []
  action:
  - data:
      payload: !secret hostname
      topic: home/hass/active
    service: mqtt.publish

This will tickle the heartbeat every 60 seconds, ensuring the active topic does not expire.

Automations

The main purpose of clustering Home Assistant is to allow either instance to take over the execution of Automations. In general however, each automation should only be executed by one instance at a time. There are two methods for achieving this:

Automation Conditions
Disabling/Enabling Automations

Automation Conditions

This is arguably the simplest method for controlling automations. Each automation should have a condition set which checks the status of the active sensor. If it's hostname matches the active sensor then the automation should run.

..
  condition:
  - condition: state
    entity_id: sensor.active
    state: !secret hostname
..

This arrangement also allows for automations which should run regardless of whether the instance is active or not, or automations which run specifically when the instance is inactive. It does however require each automation to be modified with this condition, which may be onerous for an established setup.

Disabling/Enabling Automations

This method is a little more involved to implement, but does allow existing automations to be used unmodified. The first part of the puzzle is a script to create a group which contains all the current automations:

create_every_automation_group:
  sequence:
  - service: group.set
    data_template:
      object_id: every_automation
      entities: '{{ states.automation | map(attribute=''entity_id'') | join('','')}}'

With this group created we can now control the status of each automation with the automation.turn_on service. Two automations are used to switch Home Assistant into and out of the active state:

- id: '1582909860207'
  alias: HASS:Standby
  description: ''
  trigger:
  - entity_id: sensor.active
    platform: state
    to: !secret partner
  condition: []
  action:
  - data: {}
    service: script.create_every_automation_group
  - data: {}
    entity_id: group.every_automation
    service: automation.turn_off
  - data: {}
    entity_id: automation.hass_active
    service: automation.turn_on
- id: '1582910169504'
  alias: HASS:Active
  description: ''
  trigger:
  - entity_id: sensor.active
    from: !secret partner
    platform: state
  condition: []
  action:
  - data: {}
    service: script.create_every_automation_group
  - data: {}
    entity_id: group.every_automation
    service: automation.turn_on

Note that after the instance is moved into standby, the last action of the HASS:Standby automation is to re-enable the HASS:Active automation. This allows the standby instance to become active if its partner goes offline.

na.id.au

Table of Contents