BaoYueJia Electronics-Industry news

Home  Industry news 

In-depth dissection of NVIDIA's MLCP (Microchannel Liquid-Cooled Panel) technology

Chapter 1: Thermodynamic Bottlenecks in High-Performance Computing and the Paradigm Shift of Liquid Cooling Technology In the current era of collaborative evolution between artificial intelligence and high-performance computing (HPC),

semiconductor architectures are facing unprecedented thermodynamic challenges. As computing cores transition from traditional general-purpose processors to large-scale parallel accelerators, the power density of a single chip has surpassed the critical point of physical cooling. NVIDIA has evolved from the 300W thermal design power (TDP) of the V100 architecture to the 700W of the H100 architecture, and then to the astonishing 1200W of the B200 chip in the Blackwell architecture. This data trajectory clearly reveals the end of the era of air cooling. In rack-level systems such as GB200 NVL72, the power of a single chip package even reaches a peak of 2700W, while the power consumption of the entire rack soars to 120kW or even higher, far exceeding the cooling limit of traditional air cooling technology, which is 15-20kW.

In such an extreme heat flux density environment, the Micro-channel Liquid Cooling Plate (MLCP) is no longer an option for data centers, but rather the only viable path to support the continuous operation of trillion-parameter large models. By reducing the heat transfer scale to the micrometer level, micro-channel liquid cooling technology significantly enhances the heat transfer efficiency between fluid and solid surfaces, enabling it to cope with local heat spot challenges exceeding 150W/cm². This transition from "component cooling" to "system-level integrated liquid cooling" marks the beginning of the liquid cooling era for data center infrastructure, characterized by high density, high energy efficiency, and high reliability.

Chapter 2: Core Thermophysical Principles of Microchannel Liquid-Cooled Plate

The high efficiency of microchannel liquid-cooled plates stems from the deep integration of microscale fluid mechanics and heat transfer. Its fundamental principle lies in utilizing tiny channels to create an extremely large specific surface area, and by controlling the fluid boundary layer effect, it achieves significant heat transfer under extremely small temperature differences.
2.1 Mathematical Basis and Flow Characteristics of Microscale Heat Transfer Inside the microchannel, the flow characteristics of the cooling liquid are determined by the Reynolds number (Re). The calculation formula is as follows:
In this context, ρ represents fluid density, V denotes flow velocity, Dh signifies the hydraulic diameter of the channel, and μ stands for the dynamic viscosity of the fluid. In the typical design of NVIDIA's microchannel cold plate, the hydraulic diameter Dh typically ranges from 0.05mm to 0.5mm. Due to the extremely small channel dimensions, even when the flow velocity reaches a certain level, the Reynolds number (Re) usually remains below 2000, which determines that the flow pattern within the channel is a typical laminar flow. In laminar flow, the fluid flows smoothly and predictably, which is crucial for reducing pumping power loss and maintaining uniform heat dissipation.

The precise design of hydraulic diameter directly affects thermal resistance. For microfin structures with a high aspect ratio (where the fin height L is much greater than the spacing P), the hydraulic diameter can be approximately expressed as:

This means that reducing the fin spacing is the most direct means to enhance heat dissipation efficiency, but it also poses the challenge of increased flow resistance (pressure drop). Engineers must seek the optimal balance between thermal resistance (typically required to be less than 0.03°C/W) and flow resistance (usually limited to within 20 kPa).

2.2 Reshaping of thermal resistance chain and junction temperature control

In traditional cold plate cooling solutions, heat needs to pass through multiple physical interfaces such as the chip cover plate, thermal interface material (TIM), and cold plate substrate, forming a lengthy thermal resistance chain. The MLCP technology promoted by NVIDIA, especially the future Micro Channel Lid (MCL) technology, aims to eliminate intermediate media. By directly integrating the microchannel structure into the cover plate of the chip package, the cooling liquid can be brought extremely close to the heat source core, resulting in a 3-5 times improvement in cooling efficiency compared to traditional solutions. This design can stabilize the chip junction temperature below $75^\circ \text{C}$, preventing hardware throttling due to excessive temperature, thereby ensuring the determinism of AI computing tasks.

Chapter 3: Evolution of NVIDIA's Liquid-Cooled Architecture: From GB200 to Rubin

NVIDIA's layout in liquid cooling technology presents a clear trajectory from "single component" to "platform-level integration" and then to "package-level fusion".

3.1 GB200 NVL72: Benchmark for rack-level full liquid cooling

In the Blackwell architecture, the GB200 NVL72 system incorporates a large-scale integrated cold plate design. Each computing tray is equipped with a specially customized cold plate, covering two Grace CPUs and four Blackwell GPUs.
Large cold plate strategy: GB200 adopts an integrated cold plate covering scheme, which is more physically stable and facilitates pipeline routing in narrow spaces within the rack.
Blind-mate connector integration: Through the UQD04 blind-mate connector, the tray can automatically establish a coolant connection when pushed into the rack manifold, without manual intervention, and supports online hot-swappable maintenance.

3.2 GB300 and Rubin Platform: Independent Cold Plate and MCL Technology

With the launch of the Rubin platform, NVIDIA has introduced deeper innovation in the field of heat dissipation.

Independent cold plate solution: Unlike the overall coverage of GB200, GB300 and Rubin platforms tend to adopt an independent cold plate design tailored for individual chips. This shift allows for customization of internal flow channels based on the different power consumption characteristics of CPUs and GPUs, enabling more precise temperature gradient control.
Microchannel Lid (MCL): The MCL technology, expected to be mass-produced in 2027, represents a significant breakthrough in the field of heat dissipation. It directly "carves" microchannels into the lid of the chip package, allowing the cooling fluid to directly exchange heat with the chip lid, further shortening the heat transfer path.
Chapter 4 Manufacturing Process and Material Science of Microchannel Liquid-Cooled Plate The manufacturing of microchannel cold plates falls within the interdisciplinary field of micro-nano manufacturing and precision engineering. The core challenge lies in how to process a flow channel structure with high aspect ratio and micron-level precision on a rigid metal substrate.

4.1 Key machining processes: from tooth shaving to 3D printing

Skived Fin Technology: Skiving is currently the mainstream technology for manufacturing high-density micro-fin cold plates. It involves continuously cutting and vertically bending fins from a copper substrate using a specialized tool.
Features: The thickness of the fins can be as thin as 0.05mm, and the spacing can also be controlled at around 0.05mm. Moreover, the fins and the base are integrally formed, eliminating any contact thermal resistance.
Application: Widely used in the high-heat-generating areas of NVIDIA GPU cold plates.
Laser and etching technology: Laser ablation and chemical etching can produce nonlinear and complex flow channels. Laser processing is fast, precise, and not limited by the shape of the tool; etching technology can achieve extremely high surface smoothness of the flow channels.
3D printing (additive manufacturing): Utilizing metal powder fusion technology, it is possible to create bionic flow channels or internal grid structures that are unachievable through traditional subtractive manufacturing processes. Although the cost is higher, in next-generation experimental platforms such as Rubuin, 3D printing has been proven to enhance heat dissipation efficiency by approximately 50%.

4.2 Material selection: The inevitability of high-purity oxygen-free copper

In terms of material selection, NVIDIA suppliers typically opt for C1020 oxygen-free copper (with a purity of 99.99%).
High thermal conductivity: Copper's thermal conductivity (approximately 390-400 W/m·K) far exceeds that of aluminum, enabling it to quickly conduct point-like heat sources to the surface of large-area flow channels.
Process stability: Oxygen-free copper exhibits excellent performance during vacuum brazing, avoiding peeling or oxidation due to high temperatures. This ensures the cleanliness of the internal flow within the microchannel, preventing small particles from blocking the flow path.
4.3 Connection process: The connection between the vacuum brazed and diffusion welded cold plate base and cover plate directly affects the sealing performance and lifespan of the system.
Vacuum brazing: The process of filling joints with liquid brazing filler metals in a vacuum environment, characterized by minimal deformation and high joint strength, is currently the standard process for the production of GB200 cold plates.
Diffusion welding: It achieves connection through atomic-level diffusion, with a strength close to that of the base material, and can withstand extremely high fluid pressure. It is commonly used in aviation-grade cold plates or high-performance CDU heat exchangers that have stringent pressure requirements.

Chapter 5 System-level Adaptation: Rack, Manifold, and Cooling Distribution Unit (CDU)
The heat dissipation performance of microchannel cold plates must rely on an efficient rack-level circulation system to be translated into energy efficiency benefits for data centers.
5.1 Core functions of the Cooling Distribution Unit (CDU) The CDU is the "heart" of the liquid cooling system, responsible for heat exchange between the primary circuit (facility side) and the secondary circuit (IT side).
Heat exchange capacity: The CDU recommended by NVIDIA for use within a rack typically boasts a cooling capacity of 250kW, which is sufficient to support the maximum power consumption of the NVL72 rack. Meanwhile, the inter-row CDU can be scaled up to 1.3MW or even 1.8MW, enabling support for supercomputing clusters formed by multiple liquid-cooled racks.
Control logic: The CDU is equipped with a high-precision pump set that can dynamically adjust the flow rate according to AI load, ensuring that the flow distribution matches the chip heat output in real time.

5.2 Manifold and Water Distribution Balancing Manifold are responsible for evenly delivering the cooling liquid to each computing node.
Flow channel balance: In GB200 NVL72, the manifold design needs to be optimized through Computational Fluid Dynamics (CFD) simulation to ensure that the pressure drop from the top tray to the bottom tray remains consistent, with an error controlled within 1%.
Redundancy design: Adopting a dual-loop or parallel flow channel design to ensure that a single pipeline failure will not cause the entire rack to shut down.

5.3 Quick Disconnect Coupling (UQD) The industrial standard UQD serves as a crucial interface connecting the cold plate and the manifold, and its performance directly impacts the maintainability of the system.
Blind insertion technology: UQD (such as Parker's UQD series) certified by NVIDIA NPN (Network of Partners) supports blind insertion operations and features an automatic calibration function (with a displacement tolerance of approximately 1mm), greatly simplifying the operational difficulty for data center personnel.
Zero-leakage design: The flat end face design ensures minimal residual liquid (nearly zero leakage) when disconnected, protecting expensive electronic components from liquid contamination.

Component Name | Key Specifications | Core Supplier Examples | CDU (Cooling Distribution Unit) | 250kW - 1.8MW, N+1 Redundant Pumps | Vertiv, Boyd, Auras | Manifold | 6061 Aluminum Alloy or 304 Stainless Steel, $\pm 0.1$mm Precision | Vertiv, Parker, UQD Quick Disconnects | OCP UQD Standard, Blind Mate Support | Stäubli, Danfoss, CPC Cold Plate | Oxygen-Free Copper Blade, Vacuum Brazing | Auras, Auras, Boyd | Chapter 6: Operational Reliability and Water Quality Management: The "Lifeline" of the Secondary Circuit | The extremely narrow flow channels (approximately 100-300 micrometers) of microchannel cold plates impose stringent requirements on the quality of the cooling fluid. Any minor sedimentation, corrosion, or biological growth can lead to blockages in the flow channels, potentially causing chip burnout.

6.1 Physical filtration and cleanliness control
The system must undergo a high-standard flushing process after installation.
Precision filtration: The secondary circuit (TCS) must be equipped with a filter of 50 microns or finer, and the pressure difference should be regularly checked to determine whether the filter element is saturated.
Residue cleaning: During brazing and assembly processes, it is essential to strictly control flux residues and metal debris to prevent them from accumulating at the narrow mouths of microchannels.

6.2 Chemical characteristics and corrosion management Cooling fluid typically uses deionized water or a 25% propylene glycol (PG25) solution.
Conductivity monitoring: Deionized water must maintain an extremely low conductivity (typically requiring a resistivity of $> 1 \text{ M}\Omega \cdot \text{cm}$) to reduce the risk of electrochemical corrosion.
Corrosion inhibitor and bactericide: Specific corrosion inhibitors (such as Nalco 3DT-199) must be added to form a protective film on the copper surface to prevent pitting corrosion; meanwhile, bactericides (such as NX1100) should be added to inhibit the growth of fungi such as Fusarium.

6.3 Stress Testing and Leak Detection: NVIDIA's liquid cooling system has zero tolerance for leaks.
Helium mass spectrometer leak detection: Cold plates and manifolds are usually tested with a helium mass spectrometer leak detector before leaving the factory to ensure that the leak rate reaches the vacuum-level sealing standard.
Online monitoring: Leak detection cables are typically installed at the bottom of the rack. Once abnormal humidity is detected, the CDU will immediately issue an alarm and trigger emergency braking of the pump unit.
Chapter 7: Economic Efficiency Analysis: PUE Reduction and TCO Optimization Although the initial investment (CAPEX) of the microchannel liquid cooling system is higher than that of traditional air cooling, its performance in terms of operating costs (OPEX) and total cost of ownership (TCO) is highly competitive.

7.1 The adoption of liquid cooling technology in NVIDIA AI factories, which significantly improves the PUE (Power Usage Effectiveness) index, enables the elimination of a large number of energy-consuming air conditioning fans.
Energy consumption reduction: The liquid cooling system can reduce the PUE of the data center from over 1.5 to around 1.1 or even 1.05.
Reduced water consumption: By combining a closed-loop cooling system with a natural cooling tower, the GB200 system can significantly reduce the evaporation consumption of water resources while significantly enhancing computing power.

7.2 Ultimate Release of Computing Power Output Due to liquid cooling's ability to maintain chip temperatures at a lower and more stable level, it directly leads to higher computing power benefits.
Thermal-unconstrained operation: When handling trillion-parameter LLMs, liquid cooling can support GPUs to operate at overclocked or maximum boost frequencies for extended periods, resulting in a computational efficiency improvement of approximately 30% compared to air-cooled environments.
Extended Lifespan: Lower operating temperatures reduce the risk of electron migration. According to Arrhenius's law, for every $10^\circ \text{C}$ decrease in operating temperature, the MTBF (Mean Time Between Failures) of electronic components will significantly increase.
Chapter 8: Industry Adaptation and Global Supply Chain Landscape The success of NVIDIA's microchannel liquid cooling technology is highly dependent on the global partner ecosystem it has built.

8.1 Supplier Access and Collaboration (RVL/NPN) NVIDIA has established a strict Recommended Vendor List (RVL) covering the entire process from underlying materials, precision components to system integrators.
Vertiv: As a core infrastructure partner of NVIDIA, Vertiv has jointly developed a 7MW Blackwell liquid cooling reference architecture with NVIDIA. This not only provides CDU and manifold, but also covers the entire cabinet power distribution solution.
Boyd: As a top-tier cold plate supplier, Boyd provides high-density cold plates for GB200 through advanced brazing processes, and offers different manifold solutions for racks and internal trays.
The three giants in quick connectors: Stäubli, Parker Hannifin, and CPC ensure the modularity and interoperability of liquid cooling systems by offering UQD products that meet OCP standards.

8.2 Large-scale application of cloud giants and OEMs: Amazon Web Services (AWS) has fully adopted a liquid cooling architecture for its P6e instances, utilizing the Nitro system for fine-grained management of liquid-cooled racks.
Oracle Infrastructure Cloud (OCI): Redesigned the API and infrastructure stack to automate the maintenance process for hundreds of GB200 NVL72 trays, including low-impact hot repairs and automatic load balancing.
Supermicro: With a monthly production capacity of 5,000 racks, Supermicro has become a major contributor to NVIDIA's liquid cooling solutions, integrating an end-to-end liquid cooling ecosystem into its SuperCluster.
Chapter 9 Future Outlook: The Subsequent Evolution of Microchannel Liquid Cooling Technology Microchannel liquid cooling technology is not an end point, but rather a stepping stone towards higher-dimensional cooling solutions.

9.1 Transition from Single-Phase Liquid Cooling to Two-Phase Flow Currently, single-phase liquid cooling utilizes the specific heat capacity of water. However, with the potential for single-chip power consumption to exceed 2000W in the future, two-phase flow microchannel technology, which utilizes the latent heat of vaporization of liquids for heat dissipation, is entering the research and development stage.

9.2 Silicon-based integration and 3D packaging heat dissipation The future Rubuin platform may introduce silicon-based microchannels, which involve etching flow channels directly during the production of silicon wafers, or interspersing heat dissipation microfluidic layers between 3D stacked HBM4 memory chips. This will completely break down the barriers of packaging and achieve the physical integration of heat dissipation and computing.

9.3 Comprehensive Unification of Industry Standards With NVIDIA contributing the mechanical and thermal specifications of GB200 to the OCP (Open Compute Project), microchannel cold plates, 48V power shelves, and blind-mate connectors will become industry-wide standards. This standardization will accelerate the penetration of liquid cooling technology into the second-tier cloud vendors and enterprise-level private cloud markets.

Chapter 10 Conclusion: A New Era of AI Industrialization Driven by Liquid Cooling NVIDIA's microchannel liquid cooling plate technology is not only an engineering marvel, but also an inevitable choice in the face of thermodynamic challenges in the AI era. By deeply integrating the microchannel structure from the cold plate base to the chip packaging, NVIDIA has successfully controlled extremely high energy flow at the microscale, making trillion-parameter-level model training a reality. The popularization of this technology not only brings significant energy efficiency improvements and PUE reductions, but also reshapes the supply chain landscape of global data centers, giving rise to a multi-hundred-billion-level liquid cooling industrial cluster encompassing precision manufacturing, chemical processing, and intelligent control. In the foreseeable future, microchannel liquid cooling will continue to serve as the core pillar of AI computing, supporting a smooth transition towards exascale computing power.

Go Back