Caches & Memory Hierarchy
Modern CPUs are extremely fast, but memory access often becomes the bottleneck.
This article explores the memory hierarchy and the role of caches, highlighting how they optimize performance — crucial knowledge for system design interviews and real-world engineering.
1. What is the Memory Hierarchy?
The memory hierarchy organizes storage systems by speed, cost, and size:
- Registers → Fastest, inside the CPU, store temporary data.
- Cache Memory (SRAM) → Very fast, small, located close to CPU.
- Main Memory (DRAM / RAM) → Larger but slower.
- Secondary Storage (SSD/HDD) → Very large, much slower.
- Tertiary Storage (tapes/archives) → Extremely large, extremely slow.
Goal: Give the CPU fast access to frequently used data while keeping costs manageable.
2. Why Do We Need Caches?
- CPU clock cycles are in nanoseconds.
- DRAM access can take tens to hundreds of nanoseconds.
- Without caches, CPUs would spend most of their time waiting for data.
Caches act as a “middle layer” — small but fast memory that stores hot (frequently used) data.
3. Cache Fundamentals
3.1 Key Characteristics
- Speed → Built from SRAM (faster but costlier than DRAM).
- Size → KB–MB range (vs. GBs of RAM).
- Proximity → On or very close to CPU cores.
3.2 Principles of Locality
- Temporal Locality → If data is used now, it’ll likely be used again soon.
- Spatial Locality → If one memory address is used, nearby ones are likely needed too.
Caches exploit these principles to decide what to keep.
4. How Caches Work
- Cache Hit → Data is in cache → fast access.
- Cache Miss → Data not in cache → fetch from RAM (slow).
- Cache Line → Data fetched in blocks (e.g., 64 bytes) to benefit from spatial locality.
Write Policies
- Write-Through → Update both cache + RAM immediately.
- Write-Back → Update cache only; write to RAM when line is evicted (faster, but riskier).
5. Cache Mapping Techniques
How does the CPU decide where to store a memory block in cache?
Direct Mapping
- Each block maps to exactly one cache line.
- ✅ Simple, fast lookup.
- ❌ Prone to conflicts (different blocks mapping to same line).
Fully Associative
- Any block can go anywhere in cache.
- ✅ Flexible, fewer conflicts.
- ❌ Expensive hardware (must search all lines).
Set-Associative (N-way)
- Compromise: Cache divided into sets; block can go into any line in a set.
- ✅ Balance between simplicity and flexibility.
- Example: 4-way set-associative cache = each set has 4 lines.
6. Cache Replacement Policies
When cache is full, we need to evict something:
- Least Recently Used (LRU) → Evict data unused longest.
- First-In-First-Out (FIFO) → Evict oldest data.
- Random → Pick a line randomly.
LRU is common but expensive in hardware → approximations often used.
7. Cache Levels (L1, L2, L3)
Modern CPUs use a multi-level cache hierarchy:
L1 Cache → Per core, very small (32–64 KB), fastest. Often split into:
- L1i (instructions)
- L1d (data)
L2 Cache → Per core, larger (256 KB–1 MB), slower than L1.
L3 Cache → Shared across cores, very large (4–50 MB), slower but still faster than RAM.
Key Insight: Lower-level caches (L1) prioritize speed, higher levels (L3) prioritize capacity.
8. Cache Coherence (Multi-Core Issue)
In multi-core systems, each core has its own cache. Problem: what if two cores cache the same data and one modifies it?
Solutions
- MESI Protocol
- Cache lines marked as: Modified, Exclusive, Shared, Invalid.
- Ensures only one valid copy exists.
- Directory-Based Protocols
- Central directory tracks which core holds which cache line.
Interview Tip: Be able to explain MESI with a real-world analogy (e.g., shared whiteboard updates).
9. Cache Performance Metrics
- Hit Rate = Hits ÷ Total Accesses.
- Higher = better (90%+ common).
- Miss Penalty = Extra cycles for RAM fetch (can be 100x slower).
- Average Memory Access Time (AMAT) =
AMAT = Hit Time + Miss Rate × Miss Penalty
Example
- L1 hit time = 1 cycle, miss penalty = 100 cycles, miss rate = 5%.
- AMAT = 1 + 0.05 × 100 = 6 cycles.
- Caches reduce memory access from 100 cycles → ~6 cycles.
10. Performance Pitfalls
- Cache Thrashing → Repeated misses due to poor locality.
- False Sharing → In multi-core systems, different variables in same cache line cause unnecessary invalidations.
- Working Set Too Large → Program uses more data than cache can hold → frequent misses.
11. Real-World Context
Interview Relevance
- Explain cache mapping, hit/miss, AMAT.
- MESI protocol often comes up in multi-core discussions.
Practical Engineering
- Write cache-friendly code:
- Iterate over arrays in order (good spatial locality).
- Reuse recently accessed data (good temporal locality).
- Performance debugging often reveals memory access bottlenecks, not CPU speed.
- Write cache-friendly code:
Modern Trends
- Hardware Prefetching: CPU preloads data it thinks you’ll need.
- Non-Volatile Memory (NVM) like Intel Optane sits between RAM and SSD.
- Big.LITTLE architectures: mobile CPUs optimize cache differently for power efficiency.
12. Further Reading
- Computer Organization and Design — Patterson & Hennessy
- Structured Computer Organization — Andrew S. Tanenbaum
- Intel & AMD architecture manuals (cache sizes, associativity, protocols).
- Research papers on cache coherence (MESI, MOESI, directory protocols).