Skip to content

GPUs & Parallel Computing

Graphics Processing Units (GPUs) have transformed modern computing.
Originally built for rendering graphics, GPUs now power AI training, scientific simulations, cryptocurrency mining, and big data analytics.

This article explains GPU architecture, how GPUs differ from CPUs, and their role in parallel computing — knowledge essential for system design interviews and engineering practice.

1. What is a GPU?

A Graphics Processing Unit (GPU) is a specialized processor optimized for massively parallel tasks.
While a CPU focuses on executing a few complex instructions very quickly, a GPU executes thousands of simpler operations simultaneously.

  • CPUs → Best at sequential workloads, complex control flow, and low-latency tasks.
  • GPUs → Best at parallel workloads (e.g., matrix operations, graphics, AI training).

Analogy:

  • CPU = 10 expert chefs making one dish each with perfection.
  • GPU = 500 line cooks making thousands of identical burgers at once.

2. CPU vs GPU

AspectCPUGPU
CoresFew (4–16 powerful cores)Thousands of lightweight cores
FocusSequential executionParallel execution
CacheLarge caches, branch predictionSmaller caches, high memory bandwidth
LatencyLow-latency, fast decisionsHigh throughput, slower per-task
Best ForOS tasks, business logic, single-thread speedAI, graphics, HPC, big data

3. GPU Architecture

GPUs are designed for parallel execution:

  • Cores → Thousands, grouped into Streaming Multiprocessors (SMs) (NVIDIA) or Compute Units (CUs) (AMD).
  • Memory Hierarchy:
    • Global Memory (GDDR, HBM) → High bandwidth, but high latency.
    • Shared Memory → Fast, on-chip memory within each SM.
    • Registers → Per-thread local storage.
  • Thread Groups → Managed in warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads).
  • SIMD Execution → Single instruction applied to multiple data items.
  • High-Bandwidth Memory (HBM) → In advanced GPUs, provides >1 TB/s bandwidth, crucial for AI workloads.

Interview Tip: Be able to describe how warps/wavefronts map to SIMD execution.

4. Parallel Computing with GPUs

Parallel computing = doing many things at the same time.
GPUs are built for data parallelism → applying the same operation to many data points.

4.1 Types of Parallelism

  1. Data Parallelism → Same operation on multiple data elements (e.g., matrix multiplication).
  2. Task Parallelism → Different tasks run simultaneously (common in CPUs, less so in GPUs).

4.2 GPU Programming Models

  • CUDA (NVIDIA) → Parallel programming extensions for C/C++.
  • OpenCL → Cross-platform, works with GPUs and other accelerators.
  • Graphics APIs (DirectX, OpenGL, Vulkan) → For rendering tasks.
  • Frameworks (TensorFlow, PyTorch, cuDNN) → Abstract GPU programming for ML/AI.

4.3 Execution Model

  • Thread Blocks → Tasks grouped into blocks executed on SMs.
  • SIMD Execution → Same instruction executed across many data points.
  • Latency Hiding → Thousands of threads scheduled so that when some wait for memory, others run.

5. Applications of GPUs

  1. Graphics & Rendering → Games, CAD, 3D visualization.
  2. AI & Machine Learning
    • Training neural networks (massive matrix multiplications).
    • Inference (deploying trained models).
  3. Scientific Computing → Simulations in physics, chemistry, biology, climate.
  4. Cryptocurrency Mining → Parallel hashing computations.
  5. Data Analytics → Big data queries, filtering, transformations.
  6. Video Processing → Encoding, decoding, real-time effects.

Interview Tip: Always connect GPU applications to parallelism. For example, “Training a neural network is parallelizable because each neuron’s computations can be done simultaneously.”

6. Challenges of GPU Computing

  • Programming Complexity → Efficient GPU code requires knowledge of memory coalescing, thread divergence, occupancy.
  • Memory Bottlenecks → Even with high-bandwidth memory, data transfer between CPU & GPU can be slow (PCIe bottleneck).
  • Thread Divergence → If threads in a warp take different branches, execution slows down.
  • Power Consumption → GPUs can consume 200–600W each, critical in data centers.
  • Limited General-Purpose Use → Not all workloads are parallelizable.

7. GPU vs TPU

  • GPUs → General-purpose parallel processors, flexible (graphics, AI, simulations).
  • TPUs (Tensor Processing Units) → Google’s accelerators, specialized for matrix multiplications in deep learning.
  • Key Tradeoff → GPUs = flexible, TPUs = more efficient but narrower scope.

Interview Tip: Be ready to answer: “When would you choose a GPU over a TPU?”

8. Real-World Context

  • Interview Relevance:

    • Compare CPUs, GPUs, and TPUs.
    • Discuss GPU bottlenecks (PCIe, memory).
    • Explain SIMD execution.
  • Engineering Use:

    • Training LLMs (e.g., GPT models).
    • Rendering AAA games.
    • HPC workloads (weather, genomics).
  • Modern Trends:

    • NVIDIA Hopper/Ampere → Tensor cores for AI.
    • AMD RDNA/CDNA → Competing HPC & graphics architectures.
    • Multi-GPU Systems (DGX, TPU Pods) → Scale parallelism further.
    • Unified Memory (UMA) → Emerging to reduce CPU–GPU data transfer bottlenecks.

9. Further Reading

  • Computer Organization and Design — Patterson & Hennessy
  • NVIDIA CUDA Programming Guide
  • Parallel Programming in CUDA C/C++ (online tutorials)
  • Blogs from NVIDIA, AMD, and Google on GPU/TPU advancements

Connect: LinkedIn

© 2025 Official CTO. All rights reserved.