Chat System (WhatsApp / Slack)
Design a chat system that supports 1:1 messaging, group chats, message delivery guarantees, and presence.
This is a classic HLD problem because it touches real-time communication, consistency, and scaling.
1. Requirements
Functional
- Send/receive 1:1 and group messages.
- Show message delivery status (sent, delivered, read).
- Support media (images, files, videos).
- Show user presence (online/offline, typing).
- Push notifications for new messages.
Non-functional (NFRs)
- Low latency delivery (<100ms).
- High availability (works even under partial failures).
- Durable storage (no message loss).
- Scale to millions of concurrent users.
2. Workload Estimation (Example)
Assume:
- 100M DAU, each sends 50 messages/day → 5B messages/day.
- Messages/sec = 5,000,000,000 / 86,400 ≈ 57,870 MPS avg.
- Peak (×5–10) → design for ~500k MPS at peak.
Storage:
- Each message ≈ 300 bytes (text + metadata).
- Daily = 1.5 TB, yearly ≈ 550 TB (before replication).
- With 3× replication → 1.65 PB/year.
3. High-Level Architecture
Components:
- API Gateway / Load Balancer → entry point.
- Chat Service → handles send/receive logic.
- Message Queue / Pub-Sub → decouple sender/receiver (Kafka, RabbitMQ).
- Storage Layer → message persistence (Cassandra, DynamoDB, HBase).
- Presence Service → track online/offline/typing status.
- Push Notification Service → mobile push (APNS, FCM).
- Media Service → upload media → stored in object store (S3, GCS) + CDN.
4. Message Flow
1:1 Chat
- Sender sends message → API Gateway → Chat Service.
- Chat Service writes message to DB + message queue.
- Receiver’s device subscribed to queue → receives message in real time.
- Delivery/read receipts updated asynchronously.
Group Chat
- Fan-out to group members via message queue.
- For large groups, use batch fan-out and optimize storage with shared pointers.
5. Delivery Guarantees
- At least once delivery → retries on failure.
- Idempotency keys to prevent duplicates.
- Acknowledgments from receiver for delivery status.
- Store-undelivered messages in persistent queue until acked.
6. Data Storage
- Messages → stored in partitioned wide-column DB (Cassandra).
- Partition key = (chat_id, message_id).
- Optimized for write-heavy workloads.
- Indexes → user-based indexes for search/history.
- Media → stored in object store (S3, GCS) with CDN for delivery.
- Metadata → delivery status, read receipts stored separately.
7. Presence & State
- Maintain ephemeral presence in in-memory store (Redis).
- Clients send periodic heartbeats to update presence.
- Typing indicators → short-lived events (not persisted).
8. Reliability & Scaling
- Sharding by chat_id or user_id for DB partitioning.
- Replication for durability and HA.
- Async processing for media and notifications.
- Backpressure to avoid overloading queues.
9. Security
- End-to-End Encryption (E2EE) for 1:1 chats (like WhatsApp Signal protocol).
- Transport encryption (TLS) for all client-server communication.
- Access control for group membership and media URLs.
10. Monitoring & Metrics
- MPS (messages/sec), delivery latency percentiles.
- Queue lag and DB write latency.
- Online user count and presence updates.
- Push notification delivery success.
11. Trade-offs
- E2EE vs server features: encrypted messages can’t be ranked/searched server-side.
- Push vs pull: push gives real-time, but requires long-lived connections.
- Durability vs latency: syncing all replicas adds latency; async replication reduces it but risks data loss in rare cases.
12. Extensions
- Message search and indexing.
- Ephemeral messages (disappear after time).
- Reactions, threading, voice/video calls.
- Multi-device sync (harder with E2EE).
13. Interview Tips
- Start with message flow (send → store → deliver).
- Mention durability (DB + queue) and latency.
- Talk about group vs 1:1 separately.
- Call out presence and push notifications.
- Discuss scaling storage + fan-out with sharding.