Cookie Settings

We use cookies to operate this website, improve usability, personalize your experience and improve our marketing. Your privacy is important to us. Privacy Policy.

Phaidra Factory

AI agents to share the load

AI factories require more power and experience more frequent thermal spikes than their cloud data center predecessors. When capacity grows 10x each year, there's no energy to waste.

Phaidra Factory's Agents prevent thermal spikes and stranded power. As a result, AI factories can:

Increase tokens per watt

Increase compute footprint

Reduce waste

Accelerate time-to-first-token

Maximise value of OpEx and CapEx

Reduce labor intensity

When traditional methods don't apply, AI agents take action directly to achieve peak GPU reliability and performance.

Inefficiencies are magnified at gigawatt-scale

A single GW AI factory costs upwards of $50Bn so every 1% of inefficiency can amount to $2Bn in lost revenue. Factory helps operators ensure efficient operations even as compute capacity grows exponentially.

Extreme performance requires extreme operations

AI factories must operate as a single integrated machine rather than a collection of loosely-orchestrated components. Factory makes it possible to expand capacity without compromising this delicate balance or relying on overly conservative design.

Solutions

for CDU control (rack and row-level)

Challenge: Large GPU clusters with synchronized workloads cause sudden IT load spikes, which in turn cause thermal spikes. These thermal spikes force the facility to run at significantly lower TCS temperatures to avoid GPU throttling.

Solution: AI agent that anticipates thermal spikes before they occur and preemptively controls the CDU to reduce or eliminate the spike.

Result: Precision TCS thermal control within 0.5C (i.e. 80+% reduction in the magnitude of thermal spikes). This enables the AI factory to run at significantly higher TCS temperatures while meeting SLAs — meaning higher energy efficiency and IT capacity.

Solutions

for PUE optimization

Challenge: The cooling system is the largest component of DC overhead (typically ~70% of non-IT loads). Traditional control systems deliver reliability at the expense of energy efficiency — our AI agent can do both simultaneously.

Solution: AI agent that intelligently and proactively manages the chiller plant (e.g. chiller staging, evaporator temps, differential pressures, etc.) via a BMS/SCADA integration.

Result: Significant PUE improvements arising from a large reduction in chiller plant energy consumption. Improved SLA compliance.

Solutions

for increasing IT capacity

Challenge: Cooling system power is often statically allocated and designed for the hottest possible day. This leads to stranded power which could be utilised to increase compute footprint in the same facility.

Solution: AI agent that dynamically updates power allocation policies between the cooling and IT domains using NVIDIA DSX Max-Q APIs.

Result: Power is safely unlocked to generate tokens rather than being kept idle as a precaution. That power allows for extra hardware

How it works

Phaidra Factory’s Liquid Cooling Agent monitors GPUs to predict thermal spikes. When conditions suggest a spike is imminent, the Liquid Cooling Agent begins the cooling process. With response times of less than 10 seconds, this smooths out the thermal spike and reduces the overall power draw.

Designed to maximize compute density, Agentic Power Allocation is engineered to analyze signals from scheduler jobs, power draw, and real-time weather data to model required chiller capacity with a conservative margin for error. This updated estimate aims to integrate with NVIDIA's Mission Control Domain Power Service (DPS) framework to dynamically increase compute allocation, unlocking more GPU availability while maintaining rigorous site-safety guardrails.

These are just two of several Phaidra agents to target specific high impact processes within AI factories. Watch this space for more agents to be released this year.

Phaidra is an NVIDIA DSX Omniverse partner

Learn more about us