");document.write("");
semiconductor architectures are facing unprecedented thermodynamic challenges. As computing cores transition from traditional general-purpose processors to large-scale parallel accelerators, the power density of a single chip has surpassed the critical point of physical cooling. NVIDIA has evolved from the 300W thermal design power (TDP) of the V100 architecture to the 700W of the H100 architecture, and then to the astonishing 1200W of the B200 chip in the Blackwell architecture. This data trajectory clearly reveals the end of the era of air cooling. In rack-level systems such as GB200 NVL72, the power of a single chip package even reaches a peak of 2700W, while the power consumption of the entire rack soars to 120kW or even higher, far exceeding the cooling limit of traditional air cooling technology, which is 15-20kW.
In such an extreme heat flux density environment, the Micro-channel Liquid Cooling Plate (MLCP) is no longer an option for data centers, but rather the only viable path to support the continuous operation of trillion-parameter large models. By reducing the heat transfer scale to the micrometer level, micro-channel liquid cooling technology significantly enhances the heat transfer efficiency between fluid and solid surfaces, enabling it to cope with local heat spot challenges exceeding 150W/cm². This transition from "component cooling" to "system-level integrated liquid cooling" marks the beginning of the liquid cooling era for data center infrastructure, characterized by high density, high energy efficiency, and high reliability.
The precise design of hydraulic diameter directly affects thermal resistance. For microfin structures with a high aspect ratio (where the fin height L is much greater than the spacing P), the hydraulic diameter can be approximately expressed as:
This means that reducing the fin spacing is the most direct means to enhance heat dissipation efficiency, but it also poses the challenge of increased flow resistance (pressure drop). Engineers must seek the optimal balance between thermal resistance (typically required to be less than 0.03°C/W) and flow resistance (usually limited to within 20 kPa).
2.2 Reshaping of thermal resistance chain and junction temperature control
In traditional cold plate cooling solutions, heat needs to pass through multiple physical interfaces such as the chip cover plate, thermal interface material (TIM), and cold plate substrate, forming a lengthy thermal resistance chain. The MLCP technology promoted by NVIDIA, especially the future Micro Channel Lid (MCL) technology, aims to eliminate intermediate media. By directly integrating the microchannel structure into the cover plate of the chip package, the cooling liquid can be brought extremely close to the heat source core, resulting in a 3-5 times improvement in cooling efficiency compared to traditional solutions. This design can stabilize the chip junction temperature below $75^\circ \text{C}$, preventing hardware throttling due to excessive temperature, thereby ensuring the determinism of AI computing tasks.
Chapter 3: Evolution of NVIDIA's Liquid-Cooled Architecture: From GB200 to Rubin
NVIDIA's layout in liquid cooling technology presents a clear trajectory from "single component" to "platform-level integration" and then to "package-level fusion".
In the Blackwell architecture, the GB200 NVL72 system incorporates a large-scale integrated cold plate design. Each computing tray is equipped with a specially customized cold plate, covering two Grace CPUs and four Blackwell GPUs.
Large cold plate strategy: GB200 adopts an integrated cold plate covering scheme, which is more physically stable and facilitates pipeline routing in narrow spaces within the rack.
Blind-mate connector integration: Through the UQD04 blind-mate connector, the tray can automatically establish a coolant connection when pushed into the rack manifold, without manual intervention, and supports online hot-swappable maintenance.
With the launch of the Rubin platform, NVIDIA has introduced deeper innovation in the field of heat dissipation.
Independent cold plate solution: Unlike the overall coverage of GB200, GB300 and Rubin platforms tend to adopt an independent cold plate design tailored for individual chips. This shift allows for customization of internal flow channels based on the different power consumption characteristics of CPUs and GPUs, enabling more precise temperature gradient control.
Microchannel Lid (MCL): The MCL technology, expected to be mass-produced in 2027, represents a significant breakthrough in the field of heat dissipation. It directly "carves" microchannels into the lid of the chip package, allowing the cooling fluid to directly exchange heat with the chip lid, further shortening the heat transfer path.
Chapter 4 Manufacturing Process and Material Science of Microchannel Liquid-Cooled Plate The manufacturing of microchannel cold plates falls within the interdisciplinary field of micro-nano manufacturing and precision engineering. The core challenge lies in how to process a flow channel structure with high aspect ratio and micron-level precision on a rigid metal substrate.
Skived Fin Technology: Skiving is currently the mainstream technology for manufacturing high-density micro-fin cold plates. It involves continuously cutting and vertically bending fins from a copper substrate using a specialized tool.
Features: The thickness of the fins can be as thin as 0.05mm, and the spacing can also be controlled at around 0.05mm. Moreover, the fins and the base are integrally formed, eliminating any contact thermal resistance.
Application: Widely used in the high-heat-generating areas of NVIDIA GPU cold plates.
Laser and etching technology: Laser ablation and chemical etching can produce nonlinear and complex flow channels. Laser processing is fast, precise, and not limited by the shape of the tool; etching technology can achieve extremely high surface smoothness of the flow channels.
3D printing (additive manufacturing): Utilizing metal powder fusion technology, it is possible to create bionic flow channels or internal grid structures that are unachievable through traditional subtractive manufacturing processes. Although the cost is higher, in next-generation experimental platforms such as Rubuin, 3D printing has been proven to enhance heat dissipation efficiency by approximately 50%.
In terms of material selection, NVIDIA suppliers typically opt for C1020 oxygen-free copper (with a purity of 99.99%).
High thermal conductivity: Copper's thermal conductivity (approximately 390-400 W/m·K) far exceeds that of aluminum, enabling it to quickly conduct point-like heat sources to the surface of large-area flow channels.
Process stability: Oxygen-free copper exhibits excellent performance during vacuum brazing, avoiding peeling or oxidation due to high temperatures. This ensures the cleanliness of the internal flow within the microchannel, preventing small particles from blocking the flow path.
4.3 Connection process: The connection between the vacuum brazed and diffusion welded cold plate base and cover plate directly affects the sealing performance and lifespan of the system.
Vacuum brazing: The process of filling joints with liquid brazing filler metals in a vacuum environment, characterized by minimal deformation and high joint strength, is currently the standard process for the production of GB200 cold plates.
Diffusion welding: It achieves connection through atomic-level diffusion, with a strength close to that of the base material, and can withstand extremely high fluid pressure. It is commonly used in aviation-grade cold plates or high-performance CDU heat exchangers that have stringent pressure requirements.
Chapter 5 System-level Adaptation: Rack, Manifold, and Cooling Distribution Unit (CDU)
The heat dissipation performance of microchannel cold plates must rely on an efficient rack-level circulation system to be translated into energy efficiency benefits for data centers.
5.1 Core functions of the Cooling Distribution Unit (CDU) The CDU is the "heart" of the liquid cooling system, responsible for heat exchange between the primary circuit (facility side) and the secondary circuit (IT side).
Heat exchange capacity: The CDU recommended by NVIDIA for use within a rack typically boasts a cooling capacity of 250kW, which is sufficient to support the maximum power consumption of the NVL72 rack. Meanwhile, the inter-row CDU can be scaled up to 1.3MW or even 1.8MW, enabling support for supercomputing clusters formed by multiple liquid-cooled racks.
Control logic: The CDU is equipped with a high-precision pump set that can dynamically adjust the flow rate according to AI load, ensuring that the flow distribution matches the chip heat output in real time.