Scaling Portfolio NAV Calculation from Hundreds to 500,000+ Portfolios
A Technical Case Study of the Portfolio Service Architecture
Introduction
The Portfolio Service was central to the platform’s trading ecosystem. It stored user portfolios, maintained portfolio spreads (stock holdings), provided portfolio-related APIs to other microservices, and computed Net Asset Value (NAV) every 5 minutes during market hours.
Initially the system handled only a few hundred portfolios, but after several marketing campaigns, the workload ballooned to 500,000+ portfolios. The original cron-based computation inside the core monolithic backend quickly became the bottleneck—unable to complete its work within the 5-minute window—and began affecting unrelated parts of the system.
This document describes the architectural challenges, the redesign into a scalable microservice, the hash-partitioned AWS Lambda workflow, caching and DB optimizations, resilience layers, and future improvements.
Original Architecture & Failure Modes
How NAV Was Originally Computed
A cron job inside the main backend executed every 5 minutes and iterated through all portfolios:
- Fetch portfolio spread (list of holdings)
- Fetch last computed NAV
- Fetch stock ticks
- Compute new NAV
- Write back to MongoDB
What Went Wrong at Scale
Once the portfolio count crossed 500K, the architecture failed:
- Cron couldn’t complete within 5 minutes (overlapping executions)
- Backend CPU was saturated
- MongoDB reads surged, affecting live APIs
- Per-portfolio writes caused write amplification
- No isolation, no resilience, and limited observability
It became clear that NAV computation needed to be separated and redesigned.
Extracting Portfolio Service Into Its Own Microservice
To isolate responsibilities and reduce load on the monolith, the Portfolio Service was extracted into a dedicated microservice.
New Responsibilities
Store and manage portfolio spreads
Update spreads only when trades occur (from the Order Service)
Serve portfolio data to:
- other microservices,
- the main backend,
- the NAV computation pipeline (Lambda jobs)
Non-Responsibilities
- Did not manage order execution
- Did not compute or verify trades
- Did not handle brokerage workflows
This clean separation allowed the Portfolio Service to scale independently and reliably.
Dedicated Portfolio Database with Read Replicas
As traffic increased, Portfolio Service received:
Its own MongoDB cluster (isolated from the main backend DB)
Multiple read replicas dedicated to absorbing:
- High read throughput from NAV Lambdas
- Reads from internal services needing portfolio data
This ensured the heavy NAV workload never impacted user-facing APIs.
Hash-Partitioned NAV Computation Using AWS Lambda
To parallelize NAV calculations across half a million portfolios, I implemented a hash-partitioning strategy:
hash_id = portfolio_id % 20This produced 20 equally distributed partitions, each processed independently.
Execution Flow
EventBridge triggered 20 Lambda functions every 5 minutes
Each Lambda received either:
- an environment variable like
HASH_ID=3, or - an event payload
{ "hash": 3 }
(I don’t recall exactly which method was used.)
- an environment variable like
What Each Lambda Did
Bulk load all portfolios belonging to its hash partition in a paginated manner:
jsfind({ hash: hash_id }, { projection: { spread: 1, last_nav: 1 } })Precompute unique stock symbols from the already-fetched portfolios.
Precompute additional symbols only when new pages of portfolios were fetched.
Fetch stock ticks once and cache them inside the invocation.
Compute NAV for thousands of portfolios in batch.
Write updated NAVs in bulk to MongoDB.
Partitioning the portfolios enabled predictable scaling. A failure in one shard never affected the others.
Optimizations Implemented
Bulk Fetching Portfolio Spreads
Instead of per-portfolio queries, each Lambda performed a paginated bulk projection query. This reduced:
- 25,000+ reads → a small number of paginated bulk reads
- overall read latency
- MongoDB IOPS
Read replicas absorbed these heavy queries without impacting production traffic.
Skipping NAV for Inactive Portfolios
Portfolios not accessed in the last 24 hours were skipped during the 5-minute NAV cycle.
- These were recomputed by a separate end-of-day NAV cron
- This significantly reduced unnecessary computation and DB reads
Precomputing Unique Stock Symbols per Partition
Although each Lambda processed ~25,000 portfolios, the unique stock count was typically only 300–800.
Workflow:
- Extract all symbols from the partition’s portfolios
- Deduplicate
- Fetch tick data once per symbol
- Reuse across all NAV calculations
Benefits:
- Reduced repeated lookups
- Faster computation
- Lower memory footprint
- Better overall Lambda performance
This optimization was a major driver behind the 100× scaling.
Bulk Writes to MongoDB
Each Lambda used:
bulk_write([ ...updates ])Benefits:
- Reduced write amplification
- Lower oplog pressure
- More predictable latency
- Less stress on DB primaries
Bulk writes enabled sustained updates across partitions during peak market hours.
Resilience and Observability Layer
Resilience Patterns
- Circuit breakers around multiple stock-tick providers
- Retry with exponential backoff + jitter
- Idempotent writes to avoid double computation
- Dead-letter logging for failed NAV writes
Observability
Dashboards included:
- ingestion lag
- Lambda duration per hash partition
- DB read/write heatmaps
- per-shard error rates
- tick provider latency
Alerts were triggered for:
- abnormal execution time per partition
- spikes in DB read pressure
These ensured rapid detection of issues during volatile market periods.
Remaining Bottlenecks & What I Would Improve Next
Even though the system scaled 100×, two key improvements would further optimize cost and throughput.
1. Migrate from AWS Lambda to In-House Kubernetes Jobs
Lambda offered easy horizontal scaling but became expensive:
- 20 Lambdas × every 5 minutes × long runtime = high operational cost
Kubernetes CronJobs would provide:
Lower cost (compute-based billing)
Dedicated CPU & memory per worker
Long-running workers that can retain warm caches
(Though tick data must be refreshed frequently)
Better debugging, logging, observability
Flexible scaling (20 → 50 → 100+ workers)
This is a natural evolution once traffic patterns stabilize.
2. Aggressive Caching of Portfolio Spreads
Portfolio spreads change only when trades occur, not every 5 minutes.
Fetching spreads during every NAV cycle is unnecessary.
Proposed improvement:
Maintain a cached version of portfolio spreads (Redis or dedicated collection)
Invalidate only when:
- a trade is executed
- a corporate action modifies holdings
NAV jobs (Lambda or K8s workers) would read spreads directly from cache.
Expected benefits:
- Reduce DB reads by 90–95%
- Shorter compute cycles
- Improved stability during market spikes
Decision making
If you are curious about how we made decisions, you are welcome to read my next article