I skimmed through only the parts that interested me. It’s well organized and a book I’d like to keep on hand. Throughout the book, the importance of KISS (Keep it simple, stupid) seems to be emphasized.

Physical Aspects

  • Don’t connect multiple links between Leaf and Spine. From a routing perspective, when a link fails, the same amount of traffic as before the failure will flow through the other live link. Since a certain link has failed, the expected bandwidth cannot be secured and performance degrades. It’s better to increase switches instead.
  • Treat Spine as just a transit device. If you use a certain Spine for special purposes (e.g., external connection), traffic will concentrate on that Spine. This can be solved by placing Border Leaf or Exit Leaf. Eliminate exceptions. Simplicity is strength.
  • A 3-layer Clos topology is desirable. Rather than the number of layers, it’s not appropriate to use huge and high-function switches. If you forcibly build a 2-layer Clos network by adopting huge switches for Spine switches, troubleshooting becomes complex because they are high-function. LinkedIn and Dropbox switched from chassis-switches to fixed-form-factor switches (citation needed).
  • Prepare spares so switches can be replaced immediately when they fail. You shouldn’t bother requesting replacements from support.
  • Use cables and transceivers that have been tested by NOS vendors.
  • Don’t select by comparing feature lists. Be a minimalist.
  • Follow the ASN number allocation model in this book.
    • 3-layer configuration of Leaf-Spine-SuperSpine
    • Call the Leaf-Spine set a Pod
    • Leaf has unique ASN
    • All Spines in a Pod have the same ASN (different ASN for each Pod)
    • SuperSpine has the same ASN
  • Use Unnumbered BGP.
  • Verify that loopback IP addresses are valid and correctly advertised.
  • Enable multipath.
  • Use the same eBGP session for reachability of multiple address families.
  • Use BFD.
  • Set route maps to not receive invalid Prefixes.
  • Don’t aggregate routes except at Leaf.
  • Set BGP advertisement interval timer to 0 seconds for immediate reflection.
  • Set keepalive timer to 3 seconds, hold timer to 9 seconds, connect timer to 10 seconds.
  • Minimize configuration. Important.
  • Adopt Distributed Symmetric Routing model.
  • Avoid underlay routed multicast.
  • Don’t use BUM packets.
  • Minimize configuration. Repeatedly important.
  • Start with simple things. Like assigning loopback IP addresses.
  • Separate code and data.
  • Provide validation before actually applying configuration.
  • Use Git.
  • Update in a rolling fashion. In Clos topology, you can control the impact scope.
  • Unify languages and tools. Don’t mix Ansible and Chef, Ruby and Python.
  • Use tools with huge communities like Ansible.