I read EVPN in the Data Center, so I’m leaving some notes here. You can download the PDF for free from the NVIDIA page by registering your email address and other information. These are just personal notes during investigation, so they may contain errors.

Introduction

How should we deploy applications that assume L2 on an L3 network configured with a Clos topology? For example, applications that use L2 multicast or broadcast for health monitoring and member discovery fall into this category. Ethernet VPN (EVPN) solves this problem by providing a virtual L2 network via overlay on top of an L3 network. Here, Border Gateway Protocol (BGP) is used for the EVPN control plane. While the combination of EVPN and Multiprotocol Label Switching (MPLS) is a mature technology, it has become possible to apply it to Virtual Extensible LAN (VXLAN). In short, EVPN can be viewed as a new approach to controller-based VXLAN. Since EVPN has its origins in the service provider world, there are many unfamiliar terms from the data center network perspective, making it difficult to understand. This book uses FRR, which is OSS, as a configuration example while explaining.

Network Virtualization

In virtual networks, a user can monopolize the network as if other users (or tenants) don’t exist. Which virtual network a packet belongs to is often determined by the Virtual Network Identifier (VNI) in the packet header. VLANs, L3VPNs, and VXLANs fall into this category.

VLANs are inline virtual networks, while VXLAN is an overlay virtual network, with the latter being superior in terms of scalability and operational ease. This is because upstream switches don’t need to maintain forwarding tables for virtual networks, reducing the state to manage. Furthermore, since the impact of adding/deleting virtual networks is limited to edge switches only, they can be provided to users in a short time. In overlay virtual networks, tunnel endpoints (nodes that encapsulate and decapsulate) are called Network Virtualization Edge (NVE). Main L3 tunneling technologies include VXLAN, GRE (IP Generic Routing Encapsulation), and MPLS. In VXLAN, endpoints are called VXLAN Tunnel End Points (VTEPs).

Overlay virtual networks can be further classified into two types:

  • One endpoint only establishes tunnels with a single other endpoint. L3VPN+MPLS falls into this category.
  • One endpoint establishes tunnels with multiple endpoints. Virtual Private LAN Switching (VPLS) falls into this category.

Even when packets are tunneled, underlay nodes only look at the tunnel header. Therefore, all packets are considered to have the same source and destination, and would go through the same path. So in VXLAN and other protocols, by rewriting the UDP source port, the 5-tuple hash value is changed, allowing packets to go through different paths.

On server nodes, TCP segment offload and checksum offload on NICs can reduce CPU cycles spent on packet processing. However, this is incompatible with tunneling. Of course, NICs that understand VXLAN headers exist, but due to this compatibility issue, in many cases, VXLAN encapsulation is performed on the network node side rather than on server nodes. Also, since tunneling is realized by adding headers, we need to be careful about MTU size.

The control plane is responsible for:

  • Looking at the packet destination and finding the appropriate NVE. In VXLAN, this corresponds to finding the NVE’s IP address by looking at the VNI and MAC pair.
  • Providing all NVEs with a list of virtual networks relevant to that NVE

Switch chips are transitioning from proprietary ASICs to merchant silicon. At the time of writing this book, using IPv6 as a tunnel header is difficult in many cases.

  • Broadcom Trident2 supports VXLAN. Trident2+ and Trident3 support VXLAN routing
  • Mellanox Spectrum chips support VXLAN bridging and routing
  • Other chips from Cavium and Barefoot Networks also support VXLAN bridging and routing

Looking at the software side, Linux VXLAN support has existed for quite a while, and VRF support was added by Cumulus Networks in 2015. Kernel 4.14 has stable support.

Components of EVPN

VPLS was an example of L2VPN, but the inefficiency of flood-and-learn is well known, and EVPN was born to solve that. Modern data centers are designed with Clos topology, and in many cases, ToR or Leaf acts as VTEP. In FRR, both underlay and overlay information can be exchanged with peers within a single eBGP session.

The control plane exchanges the following information:

  • Identification and recognition of network addresses being exchanged (AFI and SAFI)
  • Which virtual network the network address belongs to (RD and RT)
  • What encapsulation method is used

In EVPN, AFI/SAFI is “l2vpn/evpn”. IP addresses and MAC addresses can be duplicated between virtual networks. In other words, IP addresses and MAC addresses are unique only within the virtual network. Here, Router Distinguisher (RD) is responsible for distinguishing virtual networks. That is, the combination of RD and IP address (or MAC address) is globally unique. (In detail, RD also includes part of the VTEP’s loopback IP)

BGP can advertise additional information through path attributes. Route Target (RT) is one of the path attributes.

BGP sends Network Layer Reachability Information (NLRI) in update messages. EVPN NLRI is classified by Router Type. At minimum, RT2, 3, and 5 are needed to configure an NVPN network. FRR 4.0.1 supports RT2, 3, and 5.

  • RT2: Advertises reachability information to specific MAC or IP addresses
  • RT3: Advertises the association between VNI and VTEP of a virtual network
  • RT5: Advertises aggregated prefixes in virtual L3 networks

EVPN Bridging

In 802.1Q bridges, MAC forwarding table entries are typically collected through “flood-and-learn”. That is, when the destination MAC address is unknown, packets are forwarded to all expected ports. If there is a response, it is added to the MAC forwarding table at that timing.

In EVPN, once all advertisements are complete, all NVEs know all other NVEs related to the virtual network. After that, EVPN also learns MAC addresses using a similar mechanism. Furthermore, learned MAC addresses are advertised via BGP. For BUM (Broadcast, Unknown Unicast, Multicast) packets, there are two choices. The first is to copy the same packet on the sender side and send it, and the second is to multicast at the underlay. The former uses overlay bandwidth but simplifies the underlay. Furthermore, ARP suppression (a mechanism to cache ARP requests and results in the same virtual network) can reduce BUM traffic. The latter requires operating Protocol Independent Multicast (PIM) (needs investigation).

As a third approach, there’s dropping BUM packets. If there are silent servers that don’t send any packets themselves, there’s a problem that communication is impossible because their MAC addresses cannot be learned from EVPN, but this is quite a rare case.

Consider a case where a server is migrated under a different VTEP. In this case, the server can make MAC addresses be relearned by sending GARP. The result is advertised by BGP’s MAC Mobility extended attribute. It has a sequential number so the latest is known.

Consider a case where a server is connected under two VTEPs. Here, the two VTEPs just need to have the same IP address (multihoming).

EVPN Routing

This is about how to route between two different virtual L2 networks. Normally, in routers, routing between multiple L2 networks is handled by SVI (Switched VLAN Interface).

In EVPN/VXLAN, VTEPs handle routing. When one VTEP (or a set of VTEPs) handles it, it’s called Centralized routing. On the other hand, when all first-hop VTEPs handle it, it’s called Distributed routing.

Furthermore, Distributed routing is divided into two types. If only the sending VTEP handles routing, it’s called Asymmetric routing. If both sending and receiving VTEPs handle routing, it’s called Symmetric routing. In most cases, Distributed Symmetric routing is adopted.

Let’s look at the characteristics of Distributed Symmetric routing:

  • Packets are processed at two VTEPs: sender and receiver
  • The VNI of the virtual network connecting VTEPs is called L3 VNI

Looking at vendor support: Arista, Cisco, and Cumulus support Symmetric routing, while Cumulus and Juniper support Asymmetric routing.

EVPN Configuration and Operations

This chapter is large, so I’m thinking of writing a separate article. This article ends here for now.

(Added 2021/5/27) I created a sandbox environment where you can verify EVPN/VXLAN behavior using mininet. It works on Ubuntu 20.04 environment. In this script, we create two tenants: tenant#100 and tenant#200, and assign two subnets to each tenant. Since VRFs are separated for each tenant, routing is properly done within a tenant and communication is possible, while tenants cannot see each other’s networks and communication is prohibited.

In this example, there is a 3-layer structure of Spine-Leaf-Server, connecting Spine switches and Leaf switches in an All-to-All manner. Also, 2 servers per Leaf (8 servers in total) are connected, with 4 servers assigned to tenant#100 and the remaining 4 servers assigned to tenant#200.

https://github.com/bobuhiro11/mininetlab/blob/main/mininetlab/evpn_vxlan.py