Cookie Settings
We use cookies to operate this website, improve usability, personalize your experience and improve our marketing. Your privacy is important to us. Privacy Policy.
AI factories require more power and experience more frequent thermal spikes than their cloud data center predecessors. When capacity grows 10x each year, there's no energy to waste.
A single GW AI factory costs upwards of $50Bn so every 1% of inefficiency can amount to $2Bn in lost revenue. Factory helps operators ensure efficient operations even as compute capacity grows exponentially.
AI factories must operate as a single integrated machine rather than a collection of loosely-orchestrated components. Factory makes it possible to expand capacity without compromising this delicate balance or relying on overly conservative design.
for CDU control (rack and row-level)
Challenge: Large GPU clusters with synchronized workloads cause sudden IT load spikes, which in turn cause thermal spikes. These thermal spikes force the facility to run at significantly lower TCS temperatures to avoid GPU throttling.
Solution: AI agent that anticipates thermal spikes before they occur and preemptively controls the CDU to reduce or eliminate the spike.
Result: Precision TCS thermal control within 0.5C (i.e. 80+% reduction in the magnitude of thermal spikes). This enables the AI factory to run at significantly higher TCS temperatures while meeting SLAs — meaning higher energy efficiency and IT capacity.
for PUE optimization
Challenge: The cooling system is the largest component of DC overhead (typically ~70% of non-IT loads). Traditional control systems deliver reliability at the expense of energy efficiency — our AI agent can do both simultaneously.
Solution: AI agent that intelligently and proactively manages the chiller plant (e.g. chiller staging, evaporator temps, differential pressures, etc.) via a BMS/SCADA integration.
Result: Significant PUE improvements arising from a large reduction in chiller plant energy consumption. Improved SLA compliance.

for increasing IT capacity
Challenge: Cooling system power is often statically allocated and designed for the hottest possible day. This leads to stranded power which could be utilised to increase compute footprint in the same facility.
Solution: AI agent that dynamically updates power allocation policies between the cooling and IT domains using NVIDIA DSX Max-Q APIs.
Result: Power is safely unlocked to generate tokens rather than being kept idle as a precaution. That power allows for extra hardware
Phaidra Factory’s Liquid Cooling Agent monitors GPUs to predict thermal spikes. When conditions suggest a spike is imminent, the Liquid Cooling Agent begins the cooling process. With response times of less than 10 seconds, this smooths out the thermal spike and reduces the overall power draw.
Designed to maximize compute density, Agentic Power Allocation is engineered to analyze signals from scheduler jobs, power draw, and real-time weather data to model required chiller capacity with a conservative margin for error. This updated estimate aims to integrate with NVIDIA's Mission Control Domain Power Service (DPS) framework to dynamically increase compute allocation, unlocking more GPU availability while maintaining rigorous site-safety guardrails.
These are just two of several Phaidra agents to target specific high impact processes within AI factories. Watch this space for more agents to be released this year.
Phaidra is an NVIDIA DSX Omniverse partner

AI agents for liquid-cooled AI factories
Read Phaidra's white paper to learn how to prevent GPU throttling and safely increase facility temperatures to drive more revenue-generating compute.

Subscribe to our blog
Stay connected with our insightful systems control and AI content.
You can unsubscribe at any time. For more details, review our Privacy Policy page.