During the Supercomputing 2024 (SC24) event, Enfabrica Corporation introduced a groundbreaking innovation in AI data center networking: the Accelerated Compute Fabric (ACF) SuperNIC chip. This 3.2 Terabit-per-second (Tbps) Network Interface Card (NIC) SoC is set to revolutionize large-scale AI and machine learning (ML) operations by supporting clusters of over 500,000 GPUs. Enfabrica also secured $115 million in funding and is on track to launch the ACF SuperNIC in Q1 2025.
Addressing AI Networking Challenges
As AI models continue to grow in complexity, data centers are under pressure to efficiently connect a large number of specialized processing units, such as GPUs. The ACF SuperNIC chip aims to tackle this challenge by enabling seamless scalability and maximizing GPU utilization without performance bottlenecks.
Traditional networking solutions struggle to connect more than 100,000 AI computing chips effectively. Enfabrica’s CEO, Rochan Sankar, highlighted that their new technology can support up to 500,000 chips in a single AI/ML system, leading to more robust and reliable AI model computations.
Key Innovations in the ACF SuperNIC
The ACF SuperNIC introduces several pioneering features tailored to meet the demands of modern AI data centers:
- High-Bandwidth, Multi-Port Connectivity: The ACF SuperNIC offers multi-port 800-Gigabit Ethernet to GPU servers, significantly enhancing bandwidth and communication resilience across AI clusters.
- Efficient Two-Tier Network Design: With a high-radix configuration and ample PCIe lanes, the ACF SuperNIC streamlines AI data center architecture, reducing latency and improving data transfer efficiency.
- Scaling Up and Scaling Out: The ACF SuperNIC enables the scaling of GPU server systems, enhancing performance, scale, and resiliency in AI clusters.
- Integrated PCIe Interface: Supporting high-speed communication and flexible layout options, the ACF SuperNIC optimizes data transfer within AI workloads.
- Resilient Message Multipathing (RMM): Enfabrica’s RMM technology enhances AI cluster reliability by mitigating network link failures, ensuring smoother training processes.
- Software-Defined RDMA Networking: Offering full-stack programmability, this feature optimizes network topologies for cloud-scale operations without compromising performance.
Enhanced Resiliency and Efficiency
The ACF SuperNIC’s redundancy features minimize the impact of component failures, enhancing uptime and system reliability. The Collective Memory Zoning technology further improves data transfer efficiency and GPU server fleet performance.
Scalability and Operational Benefits
Beyond scale, the ACF SuperNIC prioritizes operational efficiency, offering compatibility with diverse AI compute environments and streamlining networking infrastructure for data center operators.
Availability and Future Prospects
Enfabrica’s ACF SuperNIC is set to launch in limited quantities in Q1 2025, with orders now open for both the chips and pilot systems. As AI models evolve, Enfabrica’s innovative approach could shape the future of AI data centers supporting Frontier AI models.
Filed in . Read more about AI (Artificial Intelligence), Chip, generative AI, Semiconductors, Server, SoC, and Supercomputer.