GPUs & Parallel Computing

Graphics Processing Units (GPUs) have transformed modern computing.
Originally built for rendering graphics, GPUs now power AI training, scientific simulations, cryptocurrency mining, and big data analytics.

This article explains GPU architecture, how GPUs differ from CPUs, and their role in parallel computing — knowledge essential for system design interviews and engineering practice.

1. What is a GPU?

A Graphics Processing Unit (GPU) is a specialized processor optimized for massively parallel tasks.
While a CPU focuses on executing a few complex instructions very quickly, a GPU executes thousands of simpler operations simultaneously.

CPUs → Best at sequential workloads, complex control flow, and low-latency tasks.
GPUs → Best at parallel workloads (e.g., matrix operations, graphics, AI training).

Analogy:

CPU = 10 expert chefs making one dish each with perfection.
GPU = 500 line cooks making thousands of identical burgers at once.

2. CPU vs GPU

Aspect	CPU	GPU
Cores	Few (4–16 powerful cores)	Thousands of lightweight cores
Focus	Sequential execution	Parallel execution
Cache	Large caches, branch prediction	Smaller caches, high memory bandwidth
Latency	Low-latency, fast decisions	High throughput, slower per-task
Best For	OS tasks, business logic, single-thread speed	AI, graphics, HPC, big data

3. GPU Architecture

GPUs are designed for parallel execution:

Cores → Thousands, grouped into Streaming Multiprocessors (SMs) (NVIDIA) or Compute Units (CUs) (AMD).
Memory Hierarchy:
- Global Memory (GDDR, HBM) → High bandwidth, but high latency.
- Shared Memory → Fast, on-chip memory within each SM.
- Registers → Per-thread local storage.
Thread Groups → Managed in warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads).
SIMD Execution → Single instruction applied to multiple data items.
High-Bandwidth Memory (HBM) → In advanced GPUs, provides >1 TB/s bandwidth, crucial for AI workloads.

Interview Tip: Be able to describe how warps/wavefronts map to SIMD execution.

4. Parallel Computing with GPUs

Parallel computing = doing many things at the same time.
GPUs are built for data parallelism → applying the same operation to many data points.

4.1 Types of Parallelism

Data Parallelism → Same operation on multiple data elements (e.g., matrix multiplication).
Task Parallelism → Different tasks run simultaneously (common in CPUs, less so in GPUs).

4.2 GPU Programming Models

CUDA (NVIDIA) → Parallel programming extensions for C/C++.
OpenCL → Cross-platform, works with GPUs and other accelerators.
Graphics APIs (DirectX, OpenGL, Vulkan) → For rendering tasks.
Frameworks (TensorFlow, PyTorch, cuDNN) → Abstract GPU programming for ML/AI.

4.3 Execution Model

Thread Blocks → Tasks grouped into blocks executed on SMs.
SIMD Execution → Same instruction executed across many data points.
Latency Hiding → Thousands of threads scheduled so that when some wait for memory, others run.

5. Applications of GPUs

Graphics & Rendering → Games, CAD, 3D visualization.
AI & Machine Learning →
- Training neural networks (massive matrix multiplications).
- Inference (deploying trained models).
Scientific Computing → Simulations in physics, chemistry, biology, climate.
Cryptocurrency Mining → Parallel hashing computations.
Data Analytics → Big data queries, filtering, transformations.
Video Processing → Encoding, decoding, real-time effects.

Interview Tip: Always connect GPU applications to parallelism. For example, “Training a neural network is parallelizable because each neuron’s computations can be done simultaneously.”

6. Challenges of GPU Computing

Programming Complexity → Efficient GPU code requires knowledge of memory coalescing, thread divergence, occupancy.
Memory Bottlenecks → Even with high-bandwidth memory, data transfer between CPU & GPU can be slow (PCIe bottleneck).
Thread Divergence → If threads in a warp take different branches, execution slows down.
Power Consumption → GPUs can consume 200–600W each, critical in data centers.
Limited General-Purpose Use → Not all workloads are parallelizable.

7. GPU vs TPU

GPUs → General-purpose parallel processors, flexible (graphics, AI, simulations).
TPUs (Tensor Processing Units) → Google’s accelerators, specialized for matrix multiplications in deep learning.
Key Tradeoff → GPUs = flexible, TPUs = more efficient but narrower scope.

Interview Tip: Be ready to answer: “When would you choose a GPU over a TPU?”

8. Real-World Context

Interview Relevance:
- Compare CPUs, GPUs, and TPUs.
- Discuss GPU bottlenecks (PCIe, memory).
- Explain SIMD execution.
Engineering Use:
- Training LLMs (e.g., GPT models).
- Rendering AAA games.
- HPC workloads (weather, genomics).
Modern Trends:
- NVIDIA Hopper/Ampere → Tensor cores for AI.
- AMD RDNA/CDNA → Competing HPC & graphics architectures.
- Multi-GPU Systems (DGX, TPU Pods) → Scale parallelism further.
- Unified Memory (UMA) → Emerging to reduce CPU–GPU data transfer bottlenecks.

9. Further Reading

Computer Organization and Design — Patterson & Hennessy
NVIDIA CUDA Programming Guide
Parallel Programming in CUDA C/C++ (online tutorials)
Blogs from NVIDIA, AMD, and Google on GPU/TPU advancements

GPUs & Parallel Computing ​

1. What is a GPU? ​

2. CPU vs GPU ​

3. GPU Architecture ​

4. Parallel Computing with GPUs ​

4.1 Types of Parallelism ​

4.2 GPU Programming Models ​

4.3 Execution Model ​

5. Applications of GPUs ​

6. Challenges of GPU Computing ​

7. GPU vs TPU ​

8. Real-World Context ​

9. Further Reading ​