This was an excellent book with key points organized in a very clear way. I was especially pleased that it included experimental sample code written in assembly 1. Even if you know concepts theoretically, running them gives you a deeper understanding. Here I’ll leave some miscellaneous notes for myself.

Assembly Notation

Intel syntax is easier to understand than AT&T syntax. I feel the same way. In Intel syntax, the destination comes on the left side. For gdb, you can set set disassembly-flavor intel in ~/.gdbinit.

Measuring Cycles per Instruction

You can statistically measure the cycles taken by add, mul, and mov instructions 2. By executing a large number of identical instructions, you can derive statistical values. Furthermore, by creating true data dependencies between these instructions, you can eliminate the effects of superscalar and superpipelining to execute instructions one at a time.

Branch Prediction

The basic idea is to cache the mapping from branch instruction memory address → jump destination memory address in a table. Then when the same branch instruction is issued again, the jump destination can be predicted. For function returns, a dedicated table for return addresses is cached. For conditional branches, past results are stored in a bitmap structure like “100010” and matched for prediction. Sometimes the results of other nearby conditional branches (in terms of memory address) are also considered. Accuracy is around 95%, which is slightly lower than typical CPU cache hit rates of 97% or higher.

Speculative Execution

Out-of-order execution based on branch prediction is called speculative execution. Essentially, this becomes the primary purpose of out-of-order execution. That is, it enables out-of-order execution beyond basic block boundaries (instruction sequences separated by conditional branches).

Cache Coherency

Cache coherency in SMP is achieved through the MSI protocol and its derivatives. The idea is that when CPU0 writes to memory, if cache lines (collections of 64 bytes) with the same address exist in CPUs other than CPU0, an invalidate flag is set there. This ensures that when other CPUs access that memory address, a cache miss always occurs, allowing them to fetch the latest value from main memory.

Memory Consistency

Especially in multicore systems with memory access, reordering of memory accesses can cause unintended results, so memory consistency must be considered as a countermeasure. Memory consistency essentially means adding constraints to out-of-order execution. However, even with in-order implementations, order changes can occur due to memory mechanisms (multiple banks, etc.), so caution is needed. x86 is based on the TSO model. In x86, the order of store and load instruction memory accesses normally doesn’t change, but reordering can occur when accessing different memory attribute regions. In x86 SSE, lfence and sfence instructions were added. Note that Linux’s barrier() macro, GCC’s memory clobber, and C’s volatile qualifier do not guarantee memory ordering in this context. This is a complex area, so it’s better to use abstractions like channels provided by programming languages.

ordering_unexpected.S 3 can reproduce memory access reordering caused by store buffers in front of memory.

Atomic Operations

This can be achieved using E (exclusive state) in the MESI cache coherency protocol. Regions surrounded by LL/SC instructions also enable atomic operations. Actually, the SC instruction succeeds if no write occurred to the memory address read by the LL instruction, and fails otherwise. So you just need to repeat the LL/SC instruction block until it succeeds. Of course, reliably extracting performance is difficult. Even on a single processor, atomic operations are necessary when time-slicing occurs.

Modern CPUs

A modern CPU is the RISC-V architecture BOOM (The Berkeley Out-of-Order RISC-V Processor) 4.