ARTS: Building Telemetry & Monitoring for OCI’s AI Data Centers

By Ravi Shankar – Senior Software Engineer, Oracle Cloud InfrastructurePublished on officialcto.com

Introduction

In 2024, I had the opportunity to contribute to one of the most ambitious AI infrastructure build-outs in the world: Oracle Cloud Infrastructure’s (OCI) Stargate initiative with OpenAI. Public sources have documented Oracle’s investment into NVIDIA’s GB200 GPUs and the construction of massive AI regions like Abilene, Texas — a facility eventually targeting over 450,000 GPUs.

During my tenure, our team was responsible for the telemetry and monitoring layer for a subset of this ecosystem: around 65,000 GPUs across multiple racks, switches, and liquid-cooling configurations. My focus was on the backend infrastructure that ingests, aggregates, and exposes GPU health signals to downstream systems such as SCADA.

NDA Note: All details below are based on publicly known information and generalized industry practices. Specific proprietary internals, configurations, topology details, or customer information are intentionally excluded.

Why ARTS Was Needed

The Abilene facility was provisioned as a child site of an existing OCI region, introducing several non-standard infra patterns. It also included NVIDIA’s newest hardware generations (GB200, NVL72-class systems, and proprietary liquid cooling elements), each needing fine-grained visibility to maintain uptime.

Every GPU rack included:

18 GPUs
18 ILOMs
9 NVSwitches
Custom cooling and connectivity infrastructure

The challenge was straightforward but enormous in scale: collect, process, and act upon telemetry for tens of thousands of GPUs in near real-time, while integrating with industrial systems (SCADA) responsible for physical safety.

Telemetry included:

Temperatures
Fault events
Connectivity/heartbeat
Health status
Link-level anomalies

Even small blind spots could cascade into large-scale downstream failures. ARTS (AI Rack Telemetry Service) was the platform created to ingest, validate, aggregate, and push this data reliably.

The Problem Space

Several constraints made this project unique:

1. We couldn't reuse existing internal codebases

Because of partnership boundaries with OpenAI, all infrastructure for telemetry had to be built from scratch, including:

APIs
Aggregation layers
IaC
Deployment pipelines
Integration modules

This forced us to create a clean, modern stack without legacy baggage.

2. Real-time monitoring with zero tolerance for stale data

GPU agents emitted telemetry frequently. If an API or DB slowed down, queueing or batching could create “lagging” data — unacceptable for safety systems.

3. Integration with SCADA — an industrial engineering domain

SCADA systems are designed for factories, power plants, and cooling systems, not cloud GPU fleets. Bridging the two worlds required careful modeling and controlled escalation paths.

High-Level Architecture

(Abstracted version, NDA-safe)

GPU Agents  →  Telemetry API (Micronaut)  →  Oracle DB (64GB, HA Replica)
                                      ↓
                               Aggregation Engine
                                      ↓
                      Abnormality Detector (Cron + Shedlock)
                                      ↓
                       SCADA Integrations (MQTT / HTTP)

My direct contributions spanned the API layer, aggregation logic, SCADA publisher, and IaC provisioning.

Why We Chose the Technologies We Did

(These decisions were made collectively across engineering and architecture teams; I’m describing the rationale, not claiming sole ownership.)

1. Java + Micronaut for the Backend

Most OCI backend systems are Java-based, which streamlined onboarding and support.

Micronaut, specifically, gave us:

Fast startup due to compile-time DI
Low memory footprint → more pods per node
Kubernetes-native integration
Mature ecosystem for resilience patterns (interceptors, circuit breakers)
Oracle-backed open source, which meant faster security-patch cycles

In a system that runs hundreds of pods under heavy telemetry ingestion, these optimizations matter significantly.

2. Oracle Database (64GB RAM + replicas)

Although telemetry systems often default to NoSQL, Oracle DB was chosen because:

It offers strong consistency guarantees, important for SCADA-driven decision paths.
HA replicas and backup automation reduce operational burden.
Oracle DB has evolved significantly — it now performs competitively in high-write scenarios.
Native OCI integration simplified monitoring, scaling, and access control.

3. Shepherd (OCI IaC) + OKE (OCI Managed Kubernetes)

We used Shepherd (OCI’s Terraform-equivalent) to define:

DB clusters
Kubernetes clusters
Network configs
IAM policies
Microservice deployments

Benefits:

Declarative reproducibility
Zero-drift across environments
Built-in OCI security best practices
Fast provisioning cycles

For orchestration, OKE was chosen over raw Kubernetes because:

It handles patching, scaling, node upgrades, cert rotation
Allows teams to focus on service logic, not cluster operations
Integrates cleanly with Oracle’s monitoring/logging stack

Given the scale and timeline, this choice eliminated entire classes of operational risk.

4. No Kafka (On Purpose)

We debated queueing systems extensively. Ultimately, telemetry was pushed directly to the API, without Kafka or SQS-like buffers, because:

Retries by agents ensured freshness; queues would introduce stale data.
Telemetry load was predictable and uniform.
Avoiding queues cut down on operational overhead (brokers, partitions, lag).
Eliminated backpressure scenarios that could cascade into SCADA triggers.

This was a simplicity-over-complexity decision that paid off.

5. Shedlock for Distributed Cron Coordination

The platform needed scheduled tasks to:

Pull aggregated telemetry
Detect abnormalities
Publish alerts to SCADA

With multiple microservice instances, Shedlock ensured:

Only one instance runs the cron at a time
DB-backed locks prevent duplicate SCADA messages
Clean, deterministic scheduling

My Role & Contributions

Here are the concrete responsibilities I owned:

1. Implemented the Telemetry Ingestion Service

Designed Micronaut-based REST APIs for GPU agents
Implemented validation, schema modeling, and ingestion flows
Handled retry semantics and failure cases
Optimized ingestion to minimize DB write overhead

2. Designed and Built the Aggregation Layer

Modeled GPU health, temp, fault, and connectivity aggregation
Created transformation pipelines for real-time visibility
Built internal endpoints for reliability engineering dashboards

3. Led SCADA Integration

Implemented MQTT and HTTP integrations
Added circuit-breaker patterns to prevent cascading failures
Defined abnormality mapping rules used by operators
Built structured payloads consumed by SCADA dashboards

4. Authored Shepherd/Terraform Modules for Provisioning

Provisioned DBs, OKE clusters, VCNs, microservice deployments
Added RBAC, encryption policies, and network-hardening defaults
Ensured reproducible dev/stage/prod environments

5. Achieved 90%+ Test Coverage

After returning from a short medical break, I used AI-assisted tooling to cover:

API-level unit tests
Integration tests
Fault-injection scenarios

This gave us confidence during load-simulations and security reviews.

Execution Challenges

1. New Tech Stack Under Tight Deadlines

Coming from a Node.js background, I had never used:

Java enterprise patterns
Micronaut
Shepherd
SCADA protocols

The learning curve was steep, but I ramped up quickly by:

Pairing with senior teammates
Studying internal patterns
Incrementally building components end-to-end

2. Scale Testing

We simulated ingestion loads equivalent to thousands of racks. Bottlenecks surfaced around:

DB write amplification
SCADA back-pressure
Aggregation frequency

These were resolved via:

Batch write strategies
Circuit-breaking
Tuning retention and indexing logic

3. Safety-Oriented Development

Unlike typical cloud features, telemetry interacts indirectly with physical infrastructure. This raised the bar for:

Idempotency
Failure isolation
Alert correctness
Time-to-detection guarantees

Outcome

The ARTS platform was delivered ahead of schedule, and provided:

Real-time visibility into 65,000+ GPUs
Unified telemetry ingestion
Actionable SCADA alerts
High availability and low operational overhead

It became a foundational layer supporting OCI’s AI datacenter expansion.

Key Lessons Learned

IaC is non-negotiable for scale — zero-drift environments save months during infra evolution.
Telemetry systems succeed only when data modeling is done well, not just when collection is fast.
Integrating cloud infra with SCADA is a fascinating collision of industrial engineering and distributed systems.
Simplicity beats theoretical elegance — avoiding Kafka was a big win.
Stepping into unfamiliar tech stacks under pressure accelerates career growth; Micronaut and Terraform are now core skills.

Final Thoughts

Working on ARTS was one of the most intense and rewarding engineering experiences of my career. It blended:

large-scale distributed systems
real-time telemetry
industrial engineering
cloud infrastructure
safety-critical design

It also taught me how fast teams can move when constraints force clarity and focus.

If you’re curious about how modern AI datacenters operate under the hood, or want to chat about large-scale telemetry design, feel free to reach out.

ARTS: Building Telemetry & Monitoring for OCI’s AI Data Centers ​

Introduction ​

Why ARTS Was Needed ​

The Problem Space ​

1. We couldn't reuse existing internal codebases ​

2. Real-time monitoring with zero tolerance for stale data ​

3. Integration with SCADA — an industrial engineering domain ​

High-Level Architecture ​

Why We Chose the Technologies We Did ​

1. Java + Micronaut for the Backend ​

2. Oracle Database (64GB RAM + replicas) ​

3. Shepherd (OCI IaC) + OKE (OCI Managed Kubernetes) ​

4. No Kafka (On Purpose) ​

5. Shedlock for Distributed Cron Coordination ​

My Role & Contributions ​

1. Implemented the Telemetry Ingestion Service ​

2. Designed and Built the Aggregation Layer ​

3. Led SCADA Integration ​

4. Authored Shepherd/Terraform Modules for Provisioning ​

5. Achieved 90%+ Test Coverage ​

Execution Challenges ​

1. New Tech Stack Under Tight Deadlines ​

2. Scale Testing ​

3. Safety-Oriented Development ​

Outcome ​

Key Lessons Learned ​

Final Thoughts ​