TSO LRO/ GRO RSS Features

Overview

This guide documents the features for Service Engine groups, such as TSO, GRO, RSS and mulitple dispatchers and queues.

TCP Segmentation Offload (TSO)

TCP segmentation offload is used to reduce the CPU overhead of TCP/ IP on fast networks. A host with TSO-enabled hardware sends TCP data to the NIC (Network Interface Card) without segmenting the data in software. This type of offload relies on the NIC to segment the data and then add the TCP, IP, and data link layer headers to each segment.

TSO Support in Routing

With routing support enabled in SE, GRO (Generic Receive Offload) feature cannot be utilised because routing is stateless and SE will not be able to segment the large GRO coalesced packet, if the packets are not allowed to be IP fragmented. Hence with the support of this feature GRO can be utilised for the routed traffic where SE will be able to segment the larger packets into smaller TCP segments either by using the TSO if supported by the interface or the routing layer in SE.

During the three-way handshake both client and server advertise their respective MSS so that the peers will not send TCP segments larger than the MSS. This is enabled by default.

Generic Receive Offload (GRO)

Generic Receive Offload (GRO) is a software technique for increasing inbound throughput of high-bandwidth network connections by reducing CPU overhead. It works by aggregating multiple incoming packets from a single flow into a larger packet chain before they are passed higher up the networking stack, thus reducing the number of packets that have to be processed.

The benefits of GRO are only seen if multiple packets for the same flow are received in a short span of time. If the incoming packets belong to different flows, then the benefits of having GRO enabled might not be seen.

The following are the two modes of GRO operation in DPDK library:

Static mode: The packets received in a single burst are subjected to the GRO layer. This is the mode of operation for NSX Advanced Load Balancer versions prior to 22.1.1.

Timer mode: The packets received are subjected to the GRO layer for a configured timeout. Special packets such as, SYN, PSH, RST and so on are not subjected to this timeout.

Starting with NSX Advanced Load Balancer version 22.1.1, GRO is based on timer mode. The timer mode GRO can deliver better performance and reduces CPU utilization compared to the static mode GRO. The timeout value can be configured in ServiceEngineGroup under dpdk_gro_timeout_interval; the default GRO timeout value is 50us for new Service Engine Group. The range of this knob can vary from 0-900us. When dpdk_gro_timeout_interval is configured to zero SEs in the Service Engine Group, reverts to static mode of GRO. For SE upgrades to 22.1.1, the dpdk_gro_timeout_interval will be zero, implying burst mode GRO which is the legacy mode of operation.

Starting with NSX Advanced Load Balancer version 22.1.2, if the SE group has SEs with greater than or equal to 8 vCPUs, GRO will be enabled.

Multi-Queue Support

Dispatcher on Avi Vantage is responsible for fetching the incoming packets from a NIC, sending them to the appropriate core for proxy work and sending back the outgoing packets to the NIC. A 40G NIC or even a 10G NIC receiving traffic at a high packet per second or PPS (for instance, in case of small UDP packets) might not be efficiently processed by a single-core dispatcher. This problem can be solved by distributing traffic from a single physical NIC across multiple queues where each queue gets processed by a dispatcher on a different core. Receive Side Scaling (RSS) enables the use of multiple queues on a single physical NIC.

Large Receive Offload (LRO)

The Large Receive Offload (LRO) is a hardware technique for increasing inbound throughput of high-bandwidth network connections by reducing CPU overhead. The incoming packets are merged at reception time so that our Packet Processing Unit sees far fewer of them. This merging can be done either in the driver or in the hardware; (or) even LRO emulation in the driver has performance benefits.

LRO is much more aggressive, which can lead to packets being combined in a lossy fashion (discarding important header data), whereas GRO is more restrictive. In particular LRO seems to be known to problematic in environments with Routing and/or forwarding, which is common in virtualisation setups.

LRO is supported only in VCenter and NSX-T Environments. LRO is validated on NSX-T (ENS mode).

For more details on LRO routing use case and LRO configuration, refer TSO LRO/ GRO RSS Configuration Guide.

Receive Side Scaling (RSS)

When RSS is enabled on Avi Vantage, NICs make use of multiple queues in the receive path. The NIC pins flow to queues and put packets belonging to the same flow to be used in the same queue. This helps the driver to spread packet processing across multiple CPUs thereby improving efficiency. On an Avi SE, the multi-queue feature is also enabled on the transmit side, that is, different flows are pinned to different queues (packets belonging to the same flow in the same queue) to distribute the packet processing among CPUs.

Note: The multi-queue feature (RSS) is not supported along with IPv6 addresses. If RSS is enabled, then IPv6 address cannot be configured for Avi Service Engine interfaces. Similarly, if the IPv6 address is already configured on Avi Service Engine interfaces, the multi-queue feature (RSS) cannot be enabled on those interfaces.

Multiple Dispatcher and Queues per NIC

Depending on the traffic processed by Avi Service Engine, the number of dispatchers can be configured with one or more than one core. Systems with high PPS load are configured with high number of dispatchers whereas proxy heavy load such as SSL workloads may not need high number of dispatchers.

Also, queues per NIC can be set for each dispatcher core for better performance. Avi Service Engine tries to detect best settings automatically for each environment.

Service Engine Datapath Isolation mode

Avi Service Engines can dedicate one or more service engine core for non se-dp tasks. This configuration particularly helps if service engines are hosting latency sensitive applications. Also, this will have a penalty on overall service engine performance as one or more core are dedicated for non se-dp tasks.

Hybrid RSS Mode

The SE hybrid RSS mode works only for DPDK mode with RSS configured and allows each SE vCPU to function as an independent unit, allowing every core to handle the dispatch and proxy job simultaneously and also disallowing the cross-core punting of the packets, i.e. For a 2 Core Service Engine, with each core tagged as (dispatcher-0, proxy-0), (dispatcher-1, proxy-1) of vCPU0 and vCPU1 respectively - any ingress flow on dispatcher-0 is egressed via proxy-0 and not punted to proxy-1 and vice versa.

The hybrid mode is brought in as a configurable property and aims at achieving higher performances on low core SE, especially 1 core and 2 core SE on vCenter / NSX-T cloud.

Auto RSS for Public Cloud

The network bandwidth (capacity) provisioned for virtual machines on public cloud depends on the instance type; and not aggregate network bandwidth of the attached network interfaces. The Service Engine will determine the published network capacity of its instance type and configures the RSS, i.e., max_queues_per_vnics and num_dispatcher_cores, appropriately in auto mode. The administrator can configure the values manually and updating these knobs requires a reboot.

Depending on the traffic profile, the dedicated dispatcher mode can also be enabled. This is a runtime property and can be toggled via dedicated_dispatcher_core (boolean) knob.

Upgrade Considerations

For NSX Advanced Load Balancer releases beyond 22.1.1, you can configure the Auto-RSS feature to auto (0) for new SE Groups.

Note: Previous configurations will be preserved after upgrade.

Additional References

Document Revision History

Date	Change Summary
July 15, 2022	Updated Generic Receive Offload (GRO) section for 22.1.1