I read RFC7938 so I’m taking notes. Since I only touched on it briefly, my terminology may be inappropriate.

Overview

Summarizing simple and highly stable network design methods for large-scale infrastructure with over 100,000 servers. BGP is adopted as the only routing protocol.

Introduction

I want to operate the infrastructure of large-scale distributed systems like web search engines with a simple and highly stable network with a small number of people.

Network Requirements

Bandwidth and latency are important factors. Traditional tree-structured networks can handle north-south traffic. However, recently, with the increase of server-to-server traffic patterns like Hadoop, east-west traffic has increased, making it difficult to handle with traditional tree-structured networks due to physical constraints such as port density.

Looking at data center capital investment, networks account for 10-15% of the entire data center. There are two approaches to reduction. The first is to consolidate all network equipment by using the same hardware and devices. Bulk purchasing leads to reduced purchasing and management costs. The second is to have network vendors compete on price. If vendors are distributed, it’s important to minimize software requirements and maintain flexibility.

Looking at operations, in L2 networks, broadcast and unicast storms cause large-scale performance problems and reduced uptime in that domain. By incorporating L3 routing into the design, the impact range can be reduced. On the other hand, with L3 routing, failures arising from control plane distribution need to be considered. This is addressed by reducing the types of routing protocols.

Application load balancing is also an important consideration. Previously, this was achieved by placing dedicated devices on the path, but it’s difficult to handle traffic growth. Therefore, a configuration that can scale horizontally by arranging multiple load balancing dedicated nodes is desirable. This can be realized by Anycast Prefix Advertisement and Equal Cost Multipath (ECMP).

To summarize the requirements,

  • REQ1: A configuration that scales horizontally by adding the same type of devices rather than upgrading devices is desirable
  • REQ2: Want to minimize software features and protocols
  • REQ3: Prefer routing protocols with low operational cost and simple implementation
  • REQ4: Want to minimize failure scope due to equipment or protocol issues
  • REQ5: Want to realize traffic engineering through protocol features

Data Center Topology

Let’s compare traditional tree-based topology with Clos-based topology. The former is composed of three layers of switches (Core layer, Aggregation/Distribution layer, Access layer) as shown below, and in upper layers, port density and bandwidth capacity must be increased to ensure bandwidth. With this configuration, as mentioned earlier, when Tier 2 is increased, the port density of Tier 1 cannot be increased enough to accommodate it. In other words, it doesn’t scale due to the constraint of the degree of each switch (the number of edges from a vertex in a graph).

             +------+  +------+
             |      |  |      |
             |      |--|      |           Tier 1
             |      |  |      |
             +------+  +------+
               |  |      |  |
     +---------+  |      |  +----------+
     | +-------+--+------+--+-------+  |
     | |       |  |      |  |       |  |
   +----+     +----+    +----+     +----+
   |    |     |    |    |    |     |    |
   |    |-----|    |    |    |-----|    | Tier 2
   |    |     |    |    |    |     |    |
   +----+     +----+    +----+     +----+
      |         |          |         |
      |         |          |         |
      | +-----+ |          | +-----+ |
      +-|     |-+          +-|     |-+    Tier 3
        +-----+              +-----+
         | | |                | | |
     <- Servers ->        <- Servers ->

                   Figure 1: Typical DC Network Topology

The latter folded Clos topology (also called fat tree) is an approach for horizontal scaling. It’s composed of an odd number of stages (also called dimensions) with uniform elements. This configuration is also called Leaf-Spine configuration. Spine corresponds to Tier 1, and Leaf corresponds to Tier 2.

   +-------+
   |       |----------------------------+
   |       |------------------+         |
   |       |--------+         |         |
   +-------+        |         |         |
   +-------+        |         |         |
   |       |--------+---------+-------+ |
   |       |--------+-------+ |       | |
   |       |------+ |       | |       | |
   +-------+      | |       | |       | |
   +-------+      | |       | |       | |
   |       |------+-+-------+-+-----+ | |
   |       |------+-+-----+ | |     | | |
   |       |----+ | |     | | |     | | |
   +-------+    | | |     | | |   ---------> M links
    Tier 1      | | |     | | |     | | |
              +-------+ +-------+ +-------+
              |       | |       | |       |
              |       | |       | |       | Tier 2
              |       | |       | |       |
              +-------+ +-------+ +-------+
                | | |     | | |     | | |
                | | |     | | |   ---------> N Links
                | | |     | | |     | | |
                O O O     O O O     O O O   Servers

                  Figure 2: 3-Stage Folded Clos Topology

The characteristics of the Clos topology are summarized as follows.

  • When M >= N, it’s completely non-blocking. Otherwise, it depends on the N/M ratio. Here N is the number of downlink ports and M is the number of uplink ports
  • ECMP support is required to fan out beyond M
  • Tier 1 switches have a unique path for each server
  • Server-to-server traffic is load-balanced by ECMP

The Clos topology scales by increasing port density or adding stages. The figure below shows the topology when expanded to 5 stages. Here, the server and the Tier 2 and Tier 3 switches connected to it are collectively called a cluster. Clusters are used as units for deployment and maintenance. Tier 3 practically corresponds to ToR. OverSubscription is used only at this layer.

                                      Tier 1
                                     +-----+
          Cluster                    |     |
 +----------------------------+   +--|     |--+
 |                            |   |  +-----+  |
 |                    Tier 2  |   |           |   Tier 2
 |                   +-----+  |   |  +-----+  |  +-----+
 |     +-------------| DEV |------+--|     |--+--|     |-------------+
 |     |       +-----|  C  |------+  |     |  +--|     |-----+       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     |       |              |                              |       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     | +-----------| DEV |------+  |     |  +--|     |-----------+ |
 |     | |     | +---|  D  |------+--|     |--+--|     |---+ |     | |
 |     | |     | |   +-----+  |   |  +-----+  |  +-----+   | |     | |
 |     | |     | |            |   |           |            | |     | |
 |   +-----+ +-----+          |   |  +-----+  |          +-----+ +-----+
 |   | DEV | | DEV |          |   +--|     |--+          |     | |     |
 |   |  A  | |  B  | Tier 3   |      |     |      Tier 3 |     | |     |
 |   +-----+ +-----+          |      +-----+             +-----+ +-----+
 |     | |     | |            |                            | |     | |
 |     O O     O O            |                            O O     O O
 |       Servers              |                              Servers
 +----------------------------+

                      Figure 3: 5-Stage Clos Topology

If the network size is small, the number of Tier 1 or Tier 2 switches can be reduced. For example, if only half the ports of Tier 1 switches are used, the number of Tier 1 switches can be halved. In that case, two links are connected from the Tier 2 switch to the same Tier 1 switch. However, when one of the two ports link fails, double the expected traffic flows to the other link, which would be bad if it leads to congestion and service degradation. Therefore, it’s good to aggregate the two links by LAG and configure them so that when one link fails, both ports fail together. Using fate-sharing as a similar technique is also good.

Data Center Routing

Let’s look at three types of designs: L2 only, L2/L3 hybrid, and L3 only.

In L2 configuration, Spanning Tree Protocol (STP) was used. At that time, data center switches did not support L3 routing or required additional license fees. With STP and similar technologies, path selection becomes Active/Standby configuration to avoid loops. Furthermore, the failure scope due to configuration errors was large. Later, Multi-Chassis Link-Aggregation (M-LAG) technology enabled Active/Active path selection, but there was no standard implementation and there were risks due to inter-device synchronization. Transparent Interconnection of Lots of Links (TRILL) made it possible to realize large, horizontally scalable L2-only configurations without using STP. However, the number of implementations is limited and can only be used with specific equipment.

In hybrid configuration, routing protocols were introduced to limit data plane failures. As a result, L2 domains could be reduced and scale-up became possible. However, L2 domains needed to be maintained below Tier 2 or Tier 3 for the following reasons.

  • Support for legacy software that requires direct membership in L2 domains
  • VM migration maintaining IP
  • Load balancing by Layer 2 Direct Server Return (L2DSR)

As time progressed, designing completely with IP routing above Tier 3 became popular. Usually, Interior Gateway Protocols (IGPs) like Open Shortest Path First (OSPF) are adopted as the main routing protocol. This configuration becomes attractive when servers exceed 10,000. If there are requirements for L2 domains, they can be handled with overlay and tunneling technologies.

Routing Protocol

Let’s look at the motivation for adopting External BGP (EBGP) as the only routing protocol. EBGP is widely used on the Internet for inter-domain connections, but was not used within data centers for the following reasons.

  • From the perception that BGP is for WAN only, it was not considered within data centers
  • There was a perception that BGP converges slower than IGP
  • There was a belief that BGP has configuration overhead and cannot auto-discover neighbors

Let’s discuss these points.

BGP is not that complex in protocol design. Compared to link-state IGPs like OSPF, internal data structures and state machines are simple. Transport depends only on TCP. BGP routers propagate only the best path as a result of route calculation. Therefore, network failures can be masked at an early stage. By appropriately allocating Autonomous System Numbers (ASNs) and detecting AS_PATH loops, BGP path hunting can be controlled. When operated with minimal policy, troubleshooting becomes easier. In most implementations, it’s straightforward to just compare the BGP Loc-RIB and Routing Information Base (RIB). Also, since Adj-RIB-IN and Adj-RIB-Out can be seen, Network Layer Reachability Information (NLRI) exchanged with peers can be confirmed.

The figure below illustrates ASN allocation.

                                ASN 65534
                               +---------+
                               | +-----+ |
                               | |     | |
                             +-|-|     |-|-+
                             | | +-----+ | |
                  ASN 646XX  | |         | |  ASN 646XX
                 +---------+ | |         | | +---------+
                 | +-----+ | | | +-----+ | | | +-----+ |
     +-----------|-|     |-|-+-|-|     |-|-+-|-|     |-|-----------+
     |       +---|-|     |-|-+ | |     | | +-|-|     |-|---+       |
     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
     |       |   |         |   |         |   |         |   |       |
     |       |   |         |   |         |   |         |   |       |
     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
     | +-----+---|-|     |-|-+ | |     | | +-|-|     |-|---+-----+ |
     | |     | +-|-|     |-|-+-|-|     |-|-+-|-|     |-|-+ |     | |
     | |     | | | +-----+ | | | +-----+ | | | +-----+ | | |     | |
     | |     | | +---------+ | |         | | +---------+ | |     | |
     | |     | |             | |         | |             | |     | |
   +-----+ +-----+           | | +-----+ | |           +-----+ +-----+
   | ASN | |     |           +-|-|     |-|-+           |     | |     |
   |65YYY| | ... |             | |     | |             | ... | | ... |
   +-----+ +-----+             | +-----+ |             +-----+ +-----+
     | |     | |               +---------+               | |     | |
     O O     O O              <- Servers ->              O O     O O

                 Figure 4: BGP ASN Layout for 5-Stage Clos
  • In EBGP, single-hop BGP sessions are established only between directly connected nodes. Multi-hop is not used.
  • To avoid collisions, use private ASNs (64512-65534)
  • Allocate one ASN for all of Tier 1
  • At Tier 2, allocate individual ASNs for each cluster
  • At Tier 3, allocate individual ASNs to all devices

Originally there were only 1023 private ASNs, so workarounds were needed such as using common ASNs across clusters for Tier 3 devices. For example, reusing ASN 65002 ~ 65032 across all clusters. Alternatively, 4-octet ASNs can be used.

In the Clos topology, there are many links, so if you try to put all those routes on BGP, the FIB overhead is large. Therefore, the following measures are needed.

  • Don’t put point-to-point links on BGP. However, traceroute etc. becomes unavailable.
  • Put point-to-point links on BGP, but aggregate at all devices. IP addresses need to be consecutive.

Server subnets should be advertised without aggregation at Tier 1 or Tier 2. This is to avoid black holes when a single link fails.

Let’s look at connectivity between the Clos topology and external networks. Connection to the outside can be considered as a dedicated cluster handling it. For example, Tier 3 devices can be considered as WAN routers, and Tier 2 devices as border routers.

Here, when advertising to WAN routers, private ASNs need to be removed from the AS_PATH attribute. You need to either Originate a default route to devices within the data center, or advertise the default route received from WAN routers. With the former, when all links from a certain border router to the WAN router fail simultaneously, it becomes a black hole. On the other hand, with the latter, the default route is no longer advertised from that border router, so this is more desirable.

Adopting either of the following methods can avoid black hole risk when advertising from border routers and aggregate server subnets. (I don’t understand this well, so I’ll review it)

  • Connect border routers in full mesh
  • Add links from Tier 1 devices to border routers

ECMP

An important feature used for load distribution in the Clos topology. Devices at a certain layer can distribute traffic to all devices in the directly connected upper layer. In the figure below, there are 4 paths from device A to server X.

                                Tier 1
                               +-----+
                               | DEV |
                            +->|  1  |--+
                            |  +-----+  |
                    Tier 2  |           |   Tier 2
                   +-----+  |  +-----+  |  +-----+
     +------------>| DEV |--+->| DEV |--+--|     |-------------+
     |       +-----|  B  |--+  |  2  |  +--|     |-----+       |
     |       |     +-----+     +-----+     +-----+     |       |
     |       |                                         |       |
     |       |     +-----+     +-----+     +-----+     |       |
     | +-----+---->| DEV |--+  | DEV |  +--|     |-----+-----+ |
     | |     | +---|  C  |--+->|  3  |--+--|     |---+ |     | |
     | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
     | |     | |            |           |            | |     | |
   +-----+ +-----+          |  +-----+  |          +-----+ +-----+
   | DEV | |     | Tier 3   +->| DEV |--+   Tier 3 |     | |     |
   |  A  | |     |             |  4  |             |     | |     |
   +-----+ +-----+             +-----+             +-----+ +-----+
     | |     | |                                     | |     | |
     O O     O O            <- Servers ->            X Y     O O

               Figure 5: ECMP Fan-Out Tree from A to X and Y

Since the same Prefix is advertised from multiple Tier 3 devices, even if the contents of the AS_PATH attribute differ, as long as the length is equal, a mechanism for load distribution is needed. This is called “multipath relax” or “multipath multiple-AS”.

When adding or removing servers to an ECMP group, you want to ensure next-hop affinity for existing flows (the property that existing flow paths don’t change). This can be realized by consistent hashing. However, note that it consumes Ternary Content-Addressable Memory (TCAM) capacity.

Convergence

Even if there is a link failure, if the RIB and FIB are updated immediately and fast EBGP is supported, it should converge in a few seconds. Typically, IGPs handle link and node failures within an AS. However, since we don’t use IGPs this time, we use BGP Keep-alive or link failure triggers. If you rely only on BGP keep-alive, the hold timer is a minimum of 3 seconds, so convergence takes time. On the other hand, with a mechanism called “fail fallover”, when interface failure is detected, the EBGP session can be closed immediately.

Ethernet links have a mechanism called Connectivity Fault Management (CFM), so failures can be detected more robustly. As another approach, by using Bidirectional Forwarding Detection (BFD), failures can be immediately notified to the BGP process.

The MinRouteAdvertisementIntervalTimer (MRAI Timer) also needs to be considered. This setting value controls the interval at which BGP UPDATE messages are sent. If this value is large, overall convergence also takes time.

In ECMP environments, “Up->Down” convergence is difficult. For example, when a certain Tier 2 device tries to withdraw a Prefix, other Tier 2 devices will receive this BGP UPDATE message from all Tier 1 devices. Here, Tier 2 devices can withdraw the target Prefix only after receiving BGP UPDATE messages from all Tier 1 devices. Also, depending on timing, BGP UPDATE messages can be distributed. This can be resolved by using the concept of “update groups” to send messages to multiple neighbors simultaneously.

Since BGP is a distance-vector type protocol, it holds only the best path. Even if there is a small link failure, if communication can be achieved via other links, the failure is confined to a certain range.

Consider a situation where a Tier 2 device loses all paths. In this situation, if the default route is configured to point upstream, traffic will ping-pong between Tier 1 and Tier 2 devices. To avoid this, it’s good to statically configure Discard routes or Null routes.

Additional Options

As already mentioned, route aggregation is difficult to introduce because it can lead to black holes during link failures. If you want to aggregate, you should model considering not only multiple link failures but also fiber pathways and optical failures.