Optimizing a Spin-Lock

A spin-lock is a mutex that never sleeps: instead of yielding to the scheduler when the lock is taken, the thread stays on the CPU and keeps retrying—spinning—until it succeeds, avoiding syscalls and context switches for critical sections of a few nanoseconds. In this post we’ll write the basic version, see why it is slow, and fix it step by step until it beats std::mutex by 3.4x under contention.

The benchmark §

Each thread increments a shared counter under the lock—250k increments in total, split evenly across threads. The total work is fixed: with a perfect lock the time stays flat as threads are added, and any growth is pure synchronization overhead. Threads are pinned to their own cores¹ ¹ With pthread_setaffinity_np, on a machine tuned for benchmarking (AMD Ryzen 7 PRO 8700GE, 8 cores at 3.65 GHz): performance governor, hyperthreading and turbo boost disabled.

template <typename SpinLock>
auto BM_SpinLock(benchmark::State& state) -> void {
  const auto num_threads = state.range(0);

  auto spin_lock = SpinLock{};
  auto val = std::uint64_t{};
  auto threads = std::vector<std::thread>{};
  threads.reserve(num_threads);

  for (auto _ : state) {
    for (auto i = 0U; i < num_threads; ++i) {
      threads.emplace_back([&, i] {
        pinThread(i);
        for (auto j = 0U; j < 250'000U / num_threads; ++j) {
          spin_lock.lock();
          ++val;
          spin_lock.unlock();
        }
      });
    }
    for (auto& thread : threads) thread.join();
    benchmark::DoNotOptimize(val);
    threads.clear();
  }
}

Every version below exposes lock and unlock. The baseline, SpinLockV0, simply wraps std::mutex

BM_SpinLock<SpinLockV0>/1/real_time       1.62 ms
BM_SpinLock<SpinLockV0>/2/real_time       5.48 ms
BM_SpinLock<SpinLockV0>/4/real_time       6.13 ms

Going from one thread to two more than triples the time, but from two to four it barely moves: a contended std::mutex parks waiters in the kernel and wakes them one at a time, so the damage does not compound. The number to beat: 6.13 ms at four threads.

A basic spin-lock §

The simplest correct spin-lock is an atomic bool and an exchange loop² ² exchange atomically writes true and returns the previous value: false means the lock was free and is now ours; true means someone else holds it, and we retry.

class SpinLockV1 {
  std::atomic_bool locked_{false};

public:
  auto lock() noexcept -> void { while (locked_.exchange(true)); }
  auto unlock() noexcept -> void { locked_.store(false); }
};

BM_SpinLock<SpinLockV1>/1/real_time       1.20 ms
BM_SpinLock<SpinLockV1>/2/real_time       10.3 ms
BM_SpinLock<SpinLockV1>/4/real_time       14.5 ms

Uncontended, the spin-lock already wins: 1.20 ms against the mutex’s 1.62 ms—locking is now a single atomic instruction with no library machinery around it. Under contention it is a disaster: 1.9x slower than the mutex at two threads, 2.4x at four—while burning CPU for the entire wait.

The problem is cache coherence. To write to a cache line, a core must first own it exclusively, invalidating every other core’s copy. An atomic exchange is a write even when it fails and merely swaps true over true. So every waiting thread constantly steals the line away from everyone else—including from the lock holder, who needs that same line back just to release the lock. The line ping-pongs between cores, and the one useful increment drowns in coherence traffic.

perf stat -d counts hardware events; compare one thread against four

$ perf stat -d ./benchmark --benchmark_filter='V1>/1' --benchmark_min_time=500x
    128,088,290      branches
      1,896,594      branch-misses            #   1.48% of all branches
    645,478,415      L1-dcache-loads
      1,687,482      L1-dcache-load-misses    #   0.26% of all accesses

$ perf stat -d ./benchmark --benchmark_filter='V1>/4' --benchmark_min_time=500x
    674,575,244      branches
    105,479,892      branch-misses            #  15.64% of all branches
  3,262,419,243      L1-dcache-loads
    293,873,584      L1-dcache-load-misses    #   9.05% of all accesses

A 35x jump in cache miss rate on the same nine bytes of data, and one branch in six mispredicted: whether the exchange succeeds is decided by the other cores, and the predictor cannot learn it. perf c2c—perf’s tool for cache-line contention—ranks cache lines by how often a core found them dirty in another core’s cache (a HITM event)

$ perf c2c record -a -- ./benchmark --benchmark_filter='V1>/4'
$ perf c2c report --stdio
=================================================
           Shared Data Cache Line Table
=================================================
Index             Address      Hitm
    0      0x7fffffffb440    91.49%
    1  0xffff8c025e923700     2.13%

A single cache line—the one holding the lock and the counter—accounts for 91% of all HITM events. Spinning is not free in watts either; the CPU’s own energy meter³ ³ The RAPL counters, which perf exposes as power/energy-pkg/; reading them requires system-wide mode (-a) and root. puts a number on the wait

$ perf stat -a -e power/energy-pkg/ ./benchmark --benchmark_filter='V1>/4' --benchmark_min_time=200x
          24.43 Joules power/energy-pkg/

The mutex does the same 200 iterations for 14.9 J.

Active backoff §

The fix is to stop writing while waiting: attempt the exchange once and, if it fails, spin on a plain load—read-only copies of the line can live in every core’s L1, so waiting generates no traffic at all. But when the holder releases, every waiter sees the lock free at the same instant, and the whole herd stampedes for the exchange—one wins, the rest pay the coherence storm anyway. To thin the herd, space out the reads with a small delay loop⁴ ⁴ volatile keeps the compiler from deleting the empty loop. The 150 iterations are a tunable parameter, experimentally determined.

class SpinLockV2 {
  std::atomic_bool locked_{false};

public:
  auto lock() noexcept -> void {
    while (true) {
      if (!locked_.exchange(true)) return;
      do {
        for (volatile int i = 0; i < 150; ++i);
      } while (locked_.load());
    }
  }
  auto unlock() noexcept -> void { locked_.store(false); }
};

At two threads the time drops to 4.82 ms—2.1x faster than the basic version and already ahead of the mutex. At four threads, 12.0 ms, most of the gain evaporates: the herd is bigger, and a fixed delay no longer keeps threads apart. The counters expose a hidden cost, too: the cache misses barely improve (8.6% against 9.1%), and the energy gets worse, 29.5 J against 24.4 J—the delay loop is busy-work executed at full speed.

Passive backoff §

The delay loop burns the whole wait executing useless increments. x86 has an instruction for exactly this: pause, exposed as the _mm_pause() intrinsic⁵ ⁵ From <immintrin.h>; ARM’s closest equivalent is the yield instruction. , which inserts a short delay with the pipeline relaxed

class SpinLockV3 {
  std::atomic_bool locked_{false};

public:
  auto lock() noexcept -> void {
    while (true) {
      if (!locked_.exchange(true)) return;
      do {
        for (auto i = 0; i < 4; ++i) _mm_pause();
      } while (locked_.load());
    }
  }
  auto unlock() noexcept -> void { locked_.store(false); }
};

The four pause calls roughly match the delay of the 150-iteration loop, and the timings land in the same place: 5.02 ms at two threads and 10.9 ms at four. The difference is in what the timings don’t show: perf counts 24x fewer instructions for the same work—the core now spends the wait deliberately doing nothing—and the energy drops from active backoff’s 29.5 J to 17.9 J.

Exponential backoff §

Pausing made the wait cheap, but the timings barely moved—and with a constant delay they cannot: all waiters poll at the same rate, so every release still wakes the whole herd, and the collisions remain. For the waiters to arrive at different times, their delays must differ: let each thread double its delay every time it finds the lock still taken, up to a cap⁶ ⁶ Without the cap, threads would end up pausing long after the lock has become free. Both bounds, 4 and 1024, are tunable.

class SpinLockV4 {
  std::atomic_bool locked_{false};

public:
  auto lock() noexcept -> void {
    auto backoff_iters = 4;
    while (true) {
      if (!locked_.exchange(true)) return;
      do {
        for (auto i = 0; i < backoff_iters; ++i) _mm_pause();
        backoff_iters = std::min(backoff_iters << 1, 1024);
      } while (locked_.load());
    }
  }
  auto unlock() noexcept -> void { locked_.store(false); }
};

The timings collapse: 1.62 ms at two threads and 1.80 ms at four—within 1.5x of the single-threaded time, approaching the flat line of a perfect lock. That is 3.4x faster than std::mutex and 8x faster than the basic spin-lock.

The counters explain the win: at four threads the cache miss rate falls to 1.6%, branch mispredictions to 2.9%, and the energy to 3.8 J, a quarter of the mutex’s 14.9 J—all near their single-threaded levels. That is the signature of threads running one at a time: backoff approximately serializes them. Waiters sit in pause loops of up to 1024 iterations, so the releasing thread usually re-acquires the lock immediately—lock and counter still warm in its L1—and races through its share of increments while everyone else stays out of the way. Serialization is optimal here, since the increments cannot run in parallel anyway.⁷ ⁷ It is also maximally unfair: nothing stops one thread from re-acquiring the lock indefinitely while another starves in its backoff loop. Real implementations bound the unfairness, or enforce FIFO order outright with a ticket lock—at a cost in throughput.

Summary §

If you want to reproduce these results, the benchmark lives in my CppPlayground repository. Each cell reads time / L1d cache misses, with the winners in bold.

Version	1 thread	2 threads	4 threads
0	1.62 ms / 0.03%	5.48 ms / 1.03%	6.13 ms / 1.60%
1	1.20 ms / 0.26%	10.3 ms / 3.78%	14.5 ms / 9.05%
2	1.25 ms / 0.25%	4.82 ms / 3.08%	12.0 ms / 8.61%
3	1.25 ms / 0.27%	5.02 ms / 2.60%	10.9 ms / 6.14%
4	1.25 ms / 0.28%	1.62 ms / 0.67%	1.80 ms / 1.59%

The journey is the interesting part: the basic spin-lock lost to std::mutex by 2.4x, and three small fixes—each derived from a measurement—turned that into a 3.4x win. In real code, std::mutex remains the right default; reach for a spin-lock when the critical section is tiny, the threads have dedicated cores, and you have measured the difference.

My mailing list is free, occasional, and covers a variety of topics. I will never sell or share your email address.

Have feedback? Email me at david@alvarezrosa.com.