I skimmed through only the parts that interested me. It’s well organized and a book I’d like to keep on hand. Throughout the book, the importance of KISS (Keep it simple, stupid) seems to be emphasized.
Physical Aspects
- Don’t connect multiple links between Leaf and Spine. From a routing perspective, when a link fails, the same amount of traffic as before the failure will flow through the other live link. Since a certain link has failed, the expected bandwidth cannot be secured and performance degrades. It’s better to increase switches instead.
- Treat Spine as just a transit device. If you use a certain Spine for special purposes (e.g., external connection), traffic will concentrate on that Spine. This can be solved by placing Border Leaf or Exit Leaf. Eliminate exceptions. Simplicity is strength.
- A 3-layer Clos topology is desirable. Rather than the number of layers, it’s not appropriate to use huge and high-function switches. If you forcibly build a 2-layer Clos network by adopting huge switches for Spine switches, troubleshooting becomes complex because they are high-function. LinkedIn and Dropbox switched from chassis-switches to fixed-form-factor switches (citation needed).
- Prepare spares so switches can be replaced immediately when they fail. You shouldn’t bother requesting replacements from support.
- Use cables and transceivers that have been tested by NOS vendors.
- Don’t select by comparing feature lists. Be a minimalist.
BGP Related
- Follow the ASN number allocation model in this book.
- 3-layer configuration of Leaf-Spine-SuperSpine
- Call the Leaf-Spine set a Pod
- Leaf has unique ASN
- All Spines in a Pod have the same ASN (different ASN for each Pod)
- SuperSpine has the same ASN
- Use Unnumbered BGP.
- Verify that loopback IP addresses are valid and correctly advertised.
- Enable multipath.
- Use the same eBGP session for reachability of multiple address families.
- Use BFD.
- Set route maps to not receive invalid Prefixes.
- Don’t aggregate routes except at Leaf.
- Set BGP advertisement interval timer to 0 seconds for immediate reflection.
- Set keepalive timer to 3 seconds, hold timer to 9 seconds, connect timer to 10 seconds.
- Minimize configuration. Important.
EVPN Related
- Adopt Distributed Symmetric Routing model.
- Avoid underlay routed multicast.
- Don’t use BUM packets.
- Minimize configuration. Repeatedly important.
Automation Related
- Start with simple things. Like assigning loopback IP addresses.
- Separate code and data.
- Provide validation before actually applying configuration.
- Use Git.
- Update in a rolling fashion. In Clos topology, you can control the impact scope.
- Unify languages and tools. Don’t mix Ansible and Chef, Ruby and Python.
- Use tools with huge communities like Ansible.