"The single-training cluster scale for large language models (LLMs) rapidly evolved from the thousand-card scale (GPT-3, Llama-2) in 2023 to the ten-thousand-card scale (Llama-3), and even to the hundred-thousand-card scale (xAI, GPT-5)."
"The power consumption of single chips is growing rapidly, and we cannot place over 100,000 H100 chips in the same state for GPT-6 training, otherwise it will lead to a grid collapse."
On average, every 100 units of 8-card Nvidia H100 computing power servers
On average, every 100 units of 8-card Nvidia H100 computing power servers
Reduce GPU core temperature: Reduce the cooling pressure of devices such as liquid cooling and air conditioning, effectively lowering The overall PUE of the data center effectively delays chip aging and ensures the performance of chips at the end of their service life Prevent chips from dropping frequencies or cards due to high temperatures, and reduce operational pressure
"In a computing cluster composed of 16,000 H100 GPUs, we observed an average of 104 hardware failures per day causing interruptions. The actual effective training time only accounts for 70% of the total time."
"Silent data corruption caused by chip aging results in at least 1000 Defective-Parts-Per-Million (DPPM), and slows down computing speed while increasing power consumption."
Multi-channel data collection and model construction for computing power chips:
Establish a unified language (evaluation standard) for the health status of computing power chips at every stage of their entire lifecycle. Real-time collection of multi-channel chip data (such as access latency, bandwidth utilization, compute core utilization, power consumption, core voltage, core temperature, etc.), and construction of derived parameters like ΔVth through patented physical simulation models, totaling 19 parameters. Based on EM-GMM, Isolation Forest, Autoencoder, etc., to achieve anomaly detection; construct an evaluation system integrating energy consumption, performance, and computational reliability, enabling real-time assessment of compute node chips. Utilize time-series data to perform real-time fine-tuning of the chip health model, effectively addressing issues such as data drift; through the addition of labeled samples for semi-supervised learning, effectively improve the accuracy of fault prediction.
By constructing specialized instruction-level test cases to stimulate the computational chip's high utilization rate, while meeting requirements such as switching activity;
Dynamically capture multi-channel data from the chip under test, and assess the health level of the chip's runtime sequence through a chip health model; The health assessment reflects post-deployment indicators of the computational chip, such as MTTF, MTBF, and energy consumption performance; Fully supports various computational chip core architectures, with evaluation models that can be flexibly adjusted based on different task scenarios like AI training, inference, and high-performance computing, ensuring the practicality and generalization capability of the assessment results; Outputs a structured health assessment report, serving as important evidence for computational asset transactions, operations, maintenance, and residual value evaluation, promoting the transparency and standardization of computational infrastructure asset development.
Load Sensing: Combining chip utilization, eBPF technology, and side-channel monitoring to accurately capture application characteristics;
Health Status Sensing: Based on chip health models, dynamically monitor and infer the health level and optimal operating state of computing nodes; Chip & Cluster Configuration Optimization: Utilizing technologies such as DVFS and AVS, combined with multiple algorithmic models, to perform real-time optimization of chips, and optimize cluster scheduling based on chip health levels to ensure each chip node operates stably and reliably.
The founding team holds 5+ invention patents covering multi-channel monitoring, chip aging prediction, and reliability.
Contact Us