Avi Vantage High Availability in VMware vCenter Environment

Overview

This guide provides details on high availability considerations for the Avi Controller and Service Engines deployed in an environment with VMware vSphere High Availability (HA) and Distributed Resource Scheduler (DRS) features enabled, or when a live vMotion is performed.

About VMware vSphere High Availability

VMware vSphere High Availability delivers the availability required by most applications running in virtual machines, independent of the operating system and applications running in it. High Availability provides uniform, cost-effective fail-over protection against hardware and operating system outages within your virtualized IT environment.

Refer to High Availability for more details on High Availability.

About VMware vSphere Distributed Resource Scheduler

VMware DRS allows the grouping of hosts into resource clusters to separate the computing needs of different business units. VMware vSphere clusters allows you to:

  • Provide highly available resources to your workloads.
  • Balance workloads for optimal performance.
  • Scale and manage computing resources without service disruption.

Refer to Distributed Resource Scheduler, Distributed Power Management for more details on DRS-DPM.

About VMware vSphere vMotion

VMware vSphere vMotion is a zero-downtime live migration of workloads from one server to another. During the workload migration, the application is still running, and you can continue to have access to the systems they need.

Refer to vSphere vMotion for more details on vMotion.

Note: Deployments should adhere to recommendations provided by VMware for configuring VMware HA, DRS, and vSAN features.

Deploying Avi Vantage in VMware HA Enabled Hosts

Avi Vantage supports Controller clusters and Service Engines while deploying on hosts with VMware HA enabled.

Deployment Prerequisites

If a VMware Cluster is configured with vSphere HA and enabled with dedicated failover hosts, then the same set of hosts should be configured in the Service Engine Group properties in the Host Exclude List field.

Observations when a Service Engine is marked down due to a Host Failure

The following are the observations and considerations while integrating with vCenter, and a Service Engine is marked down due to a host failure:

Existing Service Engines are available in the Service Engine Group and have capacity

When existing Service Engines are available for the virtual services to be placed on in case of an SE failure, the Avi Controller programs the Virtual Services on the SEs immediately. This is a standard Avi functionality, irrespective of VMware HA configuration.

Service Engines in the Service Engine Group do not have capacity

When the Controller is unable to deploy another Service Engine as the SE Group’s maximum SE capacity has been reached and no existing SEs have additional capacity, vSphere HA will restart the failed Service Engine VMs on a different ESXi host. As per Avi’s test qualifications, this operation takes 2-3 minutes.

When the Controller needs to deploy another Service Engine (in write access mode) as the existing SEs do not have capacity, the Controller will initiate deployment of a new Service Engine VM. As per Avi’s test qualifications, this operation takes 4-5 minutes.

Service Engine workloads running on VMware DRS enabled cluster

vCenter administrator need to ensure that all the ESXi hosts that are part of VMware DRS enabled cluster and going to host Service Engine workloads have the following configuration:

  • vSwtich0 created with no Physical Adapters attached (Internal only switch).

  • vSwitch0 should have the Virtual Machine Port Group for a Standard Switch PG created with the name set to Avi Internal.

Live vMotion Migrations

The behavior with pro-active vMotion migration of Controllers and Service Engines is the same as listed in Deploying Avi Vantage in vSphere DRS Enabled.

vSphere HA for SE

The virtual service operational availability and time taken for switchover in case of vSphere HA failover for SE is as follows:

VS Slot availability VS Placement/Switchover Time VS Recovery by AVI or vSphere HA
VS slots available in other SE [VS is already scaled out]. VS switches over to the available SE and the time it take for the switchover is as mentioned in Service Engine Failure Detection guide. VS recovers before vSphere HA brings up the SE.
VS Slots available in other SE [VS is not scaled out]. VS switch to other SE takes around a minute - [Based on Default Controller - SE detection time] VS recovers before vSphere HA brings up the SE.
VS slots NOT available in other SEs in the SE Group. Detection time is same as in case 1. The Controller spins a new SE, based on SE group capacity. The VS will be placed on the new SE. The VS will get placed on new SE and approximate time would be 3 to 5 mins.

Deploying Avi Vantage in vSphere DRS Enabled

Avi Controller Clusters

Avi Vantage supports the Controller clusters when deployed on hosts with vSphere DRS enabled.

vSphere DRS ensures that the Controllers VMs are available during the vMotion of a Controller VM node.

Note: There may be a momentary loss of real-time metrics data and latency during API calls to the Controller during the live vMotion window.

Avi Service Engines

Avi Vantage supports Service Engines while deploying on hosts with vSphere DRS enabled.

Deployment Prerequisites

Use of Level-5 (Aggressive) migration level is not recommended as vMotion of Service Engines can happen due to high CPU alarms in normal course of operation. For more details on high CPU utilization for Avi Services Engines as reported by the hypervisor, refer to Why does a hypervisor or a host report high CPU utilization of Avi Service Engine?

Impact on Application (Data-Plane) Traffic

Note: The following data are based on Avi Vantage’s qualification in a representative test topology.

  • vMotion results in data-plane reconfiguration at the hypervisor level. Avi Vantage’s observations with various application protocols is as follows:

    • In case of TCP based applications, the TCP protocol’s retry mechanism resolves any lost packets, without impacting the application. No traffic loss was observed.

    • In case of UDP or ICMP based applications, there is a possibility of traffic failures during a small time window. A traffic loss for 15-30ms was observed.

  • During vMotion, the Avi Controllers and other Service Engines may register a momentary loss of data-plane heartbeats to the Service Engine being vMotioned. This is manifested by the following events in succession, being generated and visible on the Controller:

    • SE_DP_HB_FAILED , followed by

    • SE_DP_HB_RECOVERED

This sequence of events can be ignored when it has occurred during the time of vMotion and has subsequently recovered.