Caches & Memory Hierarchy

Modern CPUs are extremely fast, but memory access often becomes the bottleneck.
This article explores the memory hierarchy and the role of caches, highlighting how they optimize performance — crucial knowledge for system design interviews and real-world engineering.

1. What is the Memory Hierarchy?

The memory hierarchy organizes storage systems by speed, cost, and size:

Registers → Fastest, inside the CPU, store temporary data.
Cache Memory (SRAM) → Very fast, small, located close to CPU.
Main Memory (DRAM / RAM) → Larger but slower.
Secondary Storage (SSD/HDD) → Very large, much slower.
Tertiary Storage (tapes/archives) → Extremely large, extremely slow.

Goal: Give the CPU fast access to frequently used data while keeping costs manageable.

2. Why Do We Need Caches?

CPU clock cycles are in nanoseconds.
DRAM access can take tens to hundreds of nanoseconds.
Without caches, CPUs would spend most of their time waiting for data.

Caches act as a “middle layer” — small but fast memory that stores hot (frequently used) data.

3. Cache Fundamentals

3.1 Key Characteristics

Speed → Built from SRAM (faster but costlier than DRAM).
Size → KB–MB range (vs. GBs of RAM).
Proximity → On or very close to CPU cores.

3.2 Principles of Locality

Temporal Locality → If data is used now, it’ll likely be used again soon.
Spatial Locality → If one memory address is used, nearby ones are likely needed too.

Caches exploit these principles to decide what to keep.

4. How Caches Work

Cache Hit → Data is in cache → fast access.
Cache Miss → Data not in cache → fetch from RAM (slow).
Cache Line → Data fetched in blocks (e.g., 64 bytes) to benefit from spatial locality.

Write Policies

Write-Through → Update both cache + RAM immediately.
Write-Back → Update cache only; write to RAM when line is evicted (faster, but riskier).

5. Cache Mapping Techniques

How does the CPU decide where to store a memory block in cache?

Direct Mapping
- Each block maps to exactly one cache line.
- ✅ Simple, fast lookup.
- ❌ Prone to conflicts (different blocks mapping to same line).
Fully Associative
- Any block can go anywhere in cache.
- ✅ Flexible, fewer conflicts.
- ❌ Expensive hardware (must search all lines).
Set-Associative (N-way)
- Compromise: Cache divided into sets; block can go into any line in a set.
- ✅ Balance between simplicity and flexibility.
- Example: 4-way set-associative cache = each set has 4 lines.

6. Cache Replacement Policies

When cache is full, we need to evict something:

Least Recently Used (LRU) → Evict data unused longest.
First-In-First-Out (FIFO) → Evict oldest data.
Random → Pick a line randomly.

LRU is common but expensive in hardware → approximations often used.

7. Cache Levels (L1, L2, L3)

Modern CPUs use a multi-level cache hierarchy:

L1 Cache → Per core, very small (32–64 KB), fastest. Often split into:
- L1i (instructions)
- L1d (data)
L2 Cache → Per core, larger (256 KB–1 MB), slower than L1.
L3 Cache → Shared across cores, very large (4–50 MB), slower but still faster than RAM.

Key Insight: Lower-level caches (L1) prioritize speed, higher levels (L3) prioritize capacity.

8. Cache Coherence (Multi-Core Issue)

In multi-core systems, each core has its own cache. Problem: what if two cores cache the same data and one modifies it?

Solutions

MESI Protocol
- Cache lines marked as: Modified, Exclusive, Shared, Invalid.
- Ensures only one valid copy exists.
Directory-Based Protocols
- Central directory tracks which core holds which cache line.

Interview Tip: Be able to explain MESI with a real-world analogy (e.g., shared whiteboard updates).

9. Cache Performance Metrics

Hit Rate = Hits ÷ Total Accesses.
- Higher = better (90%+ common).
Miss Penalty = Extra cycles for RAM fetch (can be 100x slower).

Average Memory Access Time (AMAT) =

AMAT = Hit Time + Miss Rate × Miss Penalty

Example

L1 hit time = 1 cycle, miss penalty = 100 cycles, miss rate = 5%.
AMAT = 1 + 0.05 × 100 = 6 cycles.
Caches reduce memory access from 100 cycles → ~6 cycles.

10. Performance Pitfalls

Cache Thrashing → Repeated misses due to poor locality.
False Sharing → In multi-core systems, different variables in same cache line cause unnecessary invalidations.
Working Set Too Large → Program uses more data than cache can hold → frequent misses.

11. Real-World Context

Interview Relevance
- Explain cache mapping, hit/miss, AMAT.
- MESI protocol often comes up in multi-core discussions.
Practical Engineering
- Write cache-friendly code:
  - Iterate over arrays in order (good spatial locality).
  - Reuse recently accessed data (good temporal locality).
- Performance debugging often reveals memory access bottlenecks, not CPU speed.
Modern Trends
- Hardware Prefetching: CPU preloads data it thinks you’ll need.
- Non-Volatile Memory (NVM) like Intel Optane sits between RAM and SSD.
- Big.LITTLE architectures: mobile CPUs optimize cache differently for power efficiency.

12. Further Reading

Computer Organization and Design — Patterson & Hennessy
Structured Computer Organization — Andrew S. Tanenbaum
Intel & AMD architecture manuals (cache sizes, associativity, protocols).
Research papers on cache coherence (MESI, MOESI, directory protocols).

Caches & Memory Hierarchy ​

1. What is the Memory Hierarchy? ​

2. Why Do We Need Caches? ​

3. Cache Fundamentals ​

3.1 Key Characteristics ​

3.2 Principles of Locality ​

4. How Caches Work ​

Write Policies ​

5. Cache Mapping Techniques ​

6. Cache Replacement Policies ​

7. Cache Levels (L1, L2, L3) ​

8. Cache Coherence (Multi-Core Issue) ​

Solutions ​

9. Cache Performance Metrics ​

Example ​

10. Performance Pitfalls ​

11. Real-World Context ​

12. Further Reading ​