Service Engine Self Election

Starting with release 18.1.2, Service Engines support a self election process for virtual services that are scaled out active/active in an L2 ARP based learning environment.

Supported Clouds:

  • VMware
  • LSC
  • Baremetal
  • No Access

Prior to version 18.1.2, the Avi Controller was the only source for determining the primary Service Engines for an application virtual service. In the case of communication failure between Service Engines and the Controller cluster, the Service Engines will continue to operate in a headless manner without an interruption to the data plane traffic. While in a headless scenario, if a Service Engine would fail there is no Controller to make a decision as to which other Service Engine can take over the primary for the affected virtual services.

With introduction of the self election process, the Service Engines themselves are now capable of making deterministic decisions as to which other Service Engine will take over the primary function for scaled out virtual services hosted on a failed Service Engine while operating in headless manner.

The Controller cluster is still the first decision maker in virtual service placement and Service Engine priority, but with the self election process Service Engines are now capable when the Controller cluster is unavailable.

Note: Only scaled out virtual services deployed in an active/active or N+M service group are supported.

How it Works

On enabling the self election feature for a Service Engine group, the Controller will send a versioned ordered list of Service Engines to all the Service Engines hosting a scaled out virtual service. The first Service Engine in the list is the primary, the second is the first backup, the third is the second backup and so on. Any new Service Engine that application gets scaled out to, is always added to the end of the list and the version is bumped.

The self election process takes effect only while the Service Engines are operating in a headless manner. By default, control plane heartbeats between the Service Engines and Controller timeout after 12 seconds, with a miss limit of 6; headless detection of 72 seconds.

> show serviceengineproperties

|   controller_heartbeat_timeout_sec              | 12 sec      
|   controller_heartbeat_miss_limit               | 6  

Service Engines will use data plane heartbeat messages to check the availability of the other Service Engines. The primary Service Engine will also include its current version of the running Service Engine list to all the other Service Engines. By default, the data plane heartbeat messages are sent every 100ms with a failure timeout count of 10; failure detection of 1 second.

> show  show serviceengineproperties

|   dp_hb_frequency                               | 100 milliseconds                
|   dp_hb_timeout_count                           | 10                   

When the primary Service Engine for a virtual service is marked down because of heartbeat failures, the next Service Engine in the ordered list will then become the primary and send out a gratuitous ARP for the virtual service IP.

In a failure recovery scenario, if any two Service Engines think they are primary, the Service Engine with the lowest index in the ordered list with the highest version will take over as the lone primary.

Failure Scenarios

For the following scenarios we will discuss the expected behavior for a failure and the subsequent recovery:

  • Headless: Service Engine 2 failure
  • Headless: Service Engine 1 failure
  • Headless: Service Engine 2 upstream switch failure
  • Headless: Service Engine 1 upstream switch failure

Figure 1 lays out the logical topology that will be used for the scenarios.

Scenario Topology Figure 1. Scenario Topology

Headless: Service Engine 2 Failure

In this scenario a virtual service is deployed active/active across the two Service Engines. The two Service Engines have already detected the Controller cluster is unavailable and are operating in a headless mode.

Headless: Service Engine 2 Failure Figure 2. Headless: Service Engine 2 Failure
Failure - Expected Behavior
  • Service Engine 1 detects Service Engine 2’s heartbeat failure
  • Service Engine 1 will stop forwarding flows to Service Engine 2
  • Virtual service remains primary on Service Engine 1
Recovery - Expected Behavior
  • Service Engine 2 is unable to register with the unavailable Controller. So, it comes up stateless
  • Service Engine 2 will ignore data plane heartbeats from Service Engine 1 because it is stateless
  • Virtual service remains primary and placed only on service Engine 1

Headless: Service Engine 1 Failure

In this scenario, a virtual service is deployed active/active across the two Service Engines. The two Service Engines have already detected the Controller cluster is unavailable and are operating in a headless mode.

Headless:  Service Engine 1 Failure Figure 3. Headless: Service Engine 1 Failure
Failure - Expected Behavior
  • Service Engine 2 detects Service Engine 1’s heartbeat failure.
  • Service Engine 2 will take over as primary and GARP the virtual service IP.
  • Virtual service becomes primary on Service Engine 2.
Recovery - Expected Behavior
  • Service Engine 1 is unable to register with the unavailable Controller, So, comes up stateless.
  • Service Engine 1 will ignore data plane heartbeats from Service Engine 2 because it is stateless
  • Virtual service remains primary and placed only on Service Engine 2

Headless: Service Engine 2 Upstream Switch Failure

In this scenario, a virtual service is deployed active/active across the two Service Engines. The two Service Engines have already detected that the Controller cluster is unavailable and are operating in a headless mode.

Headless: Service Engine 2 upstream switch failure Figure 4. Headless: Service Engine 2 upstream switch failure
Failure - Expected Behavior
  • Service Engine 1 detects Service Engine 2’s heartbeat failure.
  • Service Engine 1 will stop forwarding flows to Service Engine 2.
  • Service Engine 1 remains primary.
  • Service Engine 2 detects Service Engine 1 heartbeat failure.
  • Service Engine 2 will also take over as primary and GARP the virtual service IP (should be no harm as Service Engine 2 is isolated from the network).
Recovery - Expected Behavior
  • Service Engine 1 detects heartbeats from Service Engine 2 and starts forwarding flows.
  • Service Engine 1 remains primary.
  • Service Engine 2 also detects heartbeats from Service Engine 1, relinquishes primary to Service Engine 1.
  • There could be intermittent ARP flapping while Service Engine 2 relinquishes primary.
  • Virtual service deployed active/active across both Service Engines.

Headless: Service Engine 1 Upstream Switch Failure

In this scenario, a virtual service is deployed active/active across the two Service Engines. The two Service Engines have already detected that the Controller cluster is unavailable and are operating in a headless mode.

Headless: Service Engine 1 upstream switch failure Figure 5. Headless: Service Engine 1 upstream switch failure
Failure - Expected Behavior
  • Service Engine 1 detects Service Engine 2’s heartbeat failure.
  • Service Engine 1 will stop forwarding flows to Service Engine 2.
  • Service Engine 1 remains primary.
  • Service Engine 2 detects Service Engine 1’s heartbeat failure.
  • Service Engine 2 will also take over as primary and GARP the virtual service IP (should be no harm as Service Engine 1 is isolated from the network).
Recovery - Expected Behavior
  • Service Engine 1 detects heartbeats from Service Engine 2 and starts forwarding flows.
  • Service Engine 1 remains primary.
  • Service Engine 2 also detects heartbeats from Service Engine 1, relinquishes primary to Service Engine 1.
  • There could be intermittent arp flapping while Service Engine 2 relinquishes primary.
  • Virtual service deployed active/active across both Service Engines.

Additional Failure Scenarios

For these scenarios we will be adding another Service Engine to the topology. Figure 6 lays out the logical topology that will used for these scenarios.

  • Multi Failure: Controller, Service Engine 1 and Service Engine 2 upstream switch failure
  • Multi Failure: Service Engine 1 and Service Engine 2 upstream switch failures
Additional Scenario Topology Figure 6. Additional Scenario Topology

Multi Failure: Controller, Service Engine 1 and Service Engine 2 Upstream Switch Failures

In this scenario, a virtual service is scaled out active/active/active across the three Service Engines. The deployment begins in a healthy state with successful communication between the Controller and all the Service Engines. Multiple failures occur simultaneously resulting in the Controller, Service Engine 1 and Service Engine 2 becoming unavailable on the network.

Note: Service Engines must be operating in a headless manner before self election becomes active.

Multi Failure: Controller, Service Engine 1 and Service Engine 2 upstream switch failures Figure 7. Multi Failure: Controller, Service Engine 1 and Service Engine 2 upstream switch failures
Failure - Expected Behavior
  • Service Engine 1,2, and 3 all detect control plane heartbeat failures from the Controller.
  • Service Engines go headless after controller heartbeat timers expires (72 seconds).
  • Service Engine 1 detects Service Engine 2 and Service Engine 3 heartbeat failures.
  • Service Engine 1 will stop forwarding flows to Service Engine 2 and Service Engine 2.
  • Service Engine 1 remains primary.
  • Service Engine 2 detects Service Engine 1 heartbeat failure.
  • Service Engine 2 will also take over as primary and GARP the virtual service IP (should be no harm as Service Engine 1 and Service Engine 2 are both isolated from the network).
  • Service Engine 3 detects Service Engine 1 heartbeat failure, then detects Service Engine 2 heartbeat failure.
  • Service Engine 3 will also take over as primary and GARP the virtual service IP (should be no harm as Service Engine 1 and Service Engine 2 are both isolated from the network).
Recovery - Expected Behavior
  • Service Engine 1 detects heartbeats from both Service Engine 2 and Service Engine 3 and starts forwarding flows.
  • Service Engine 1 remains primary.
  • Service Engine 2 also detects heartbeats from Service Engine 1, relinquishes primary to Service Engine 1.
  • Service Engine 3 also detects heartbeats from Service Engine 1, relinquishes primary to Service Engine 1.
  • There could be intermittent arp flapping while Service Engine 2 and Service Engine 3 relinquishes primary.
  • Virtual service deployed active/active/active across all 3 Service Engines.

Multi Failure: Service Engine 1 and Service Engine 2 Upstream Switch Failures

In this scenario a virtual service is scaled out active/active/active across the three Service Engines. The deployment begins in a healthy state with successful communication between the Controller and all the Service Engines. Multiple failures occur simultaneously resulting in the in Service Engine 1 and Service Engine 2 becoming unavailable on the network.

Multi Failure: Service Engine 1 and Service Engine 2 upstream switch failures Figure 8. Multi Failure: Service Engine 1 and Service Engine 2 upstream switch failures
Failure - Expected Behavior
  • Service Engine 1 detects Service Engine 2 and Service Engine 3 heartbeat failures.
  • Service Engine 1 will stop forwarding flows to Service Engine 2 and Service Engine 2.
  • Service Engine 1 remains primary.
  • Service Engine 2 detects Service Engine 1 heartbeat failure.
  • Service Engine 2 will also take over as primary and GARP the virtual service IP (should be no harm as Service Engine 1 and Service Engine 2 are both isolated from the network).
  • Service Engine 3 detects Service Engine 1 heartbeat failure, then detects Service Engine 2 heartbeat failure.
  • Service Engine 3 will also take over as primary and GARP the virtual service IP (should be no harm as Service Engine 1 and Service Engine 2 are both isolated from the network).
  • Controller sends a new SE list to Service Engine 3 with an updated version.
Recovery - Expected Behavior
  • Service Engine 3 detects heartbeats from both Service Engine 2 and Service Engine 1 and exchanges SE list with higher version.
  • Service Engine 1 and Service Engine 2 detect high version list from Service Engine 3 and relinquish primary.
  • Service Engine 3 remains primary.
  • There could be intermittent ARP flapping while Service Engine 1 and Service Engine 2 relinquish primary Service Engine.
  • Virtual service deployed active/active/active across all 3 Service Engines.

Self Election Configuration

The self election process is disabled by default. Currently this configuration is only available via CLI or API.
To enable the process, modify the self_se_election parameter to True within the serviceenginegroup configuration


configure serviceenginegroup Default-Group
self_se_election
Overwriting the previously entered value for self_se_election
save

---------------------------------------+---------------------------------------------------------+
| Field                                 | Value                                                   |
+---------------------------------------+---------------------------------------------------------+

truncated .....


| max_public_ips_per_lb                 | 30                                                      |
| waf_learning_memory                   | 0 mb                                                    |
| waf_learning_interval                 | 10 min                                                  |
| self_se_election                      | True                                                    |
| minimum_connection_memory             | 20 percent                                              |
| shm_minimum_config_memory             | 4 mb                                                    |
| heap_minimum_config_memory            | 8 mb                                                    |
| disable_se_memory_check               | True                                                    |
| memory_for_config_update              | 15 percent                                              |
+---------------------------------------+---------------------------------------------------------+