A third InfiniBand dialect in the works – again


The InfiniBand interconnect grew out of the ashes of a struggle over the future of server I / O at the end of the last millennium, and instead of becoming that generic I / O, it became a low-latency, high-performance interconnect. large bandwidth used for high performance computing. And in this role he undoubtedly succeeded.

Over the past fifteen years, InfiniBand has grown as a system interconnect with some vendors – IBM has used InfiniBand as a peripheral I / O bus on power systems and mainframes for many years, but never called him that. InfiniBand was also used as a cluster storage backbone and is now the preferred inter-node network for AI clusters performing machine learning training. If you were building a DB cluster, you would probably choose InfiniBand interconnects, like Oracle did for its Exadata system, for example.

After two decades, this turns out to be a reasonable facsimile of the vision that Phil Murphy, one of the co-founders of Cornelis Networks and CEO of the company, had when he left Unisys in 1999 after the founding of ‘InfiniBand by IBM and Intel and formed SilverStorm Technologies to manufacture InfiniBand switching hardware and software. PathScale, a manufacturer of InfiniBand host adapters, was acquired by manufacturer of fiber channel switches and adapters QLogic for $ 109 million in February 2006, then QLogic continued with the acquisition of SilverStorm for $ 60 million in October 2006 to complete the acquisition of InfiniBand switches from Ancor Communications. for $ 15 million almost six years earlier – some say before the market was really ready for InfiniBand switching.

QLogic merged these technologies to create its TrueScale InfiniBand charging platform, which was acquired by Intel in January 2012 for $ 125 million and which ran much of the networking software stack on the CPU cores of server nodes – something Intel obviously loved. Just three months later, Intel acquired Cray’s “Gemini” XT and “Aries” XC interconnect companies for $ 140 million, and began creating the Omni-Path interconnect, which would marry some of the concepts of InfiniBand with Aries to create a new type of high performance interconnect suitable for all the workloads mentioned above. Omni-Path was a key part of the Xeon Phi “Knights” compute accelerators and Intel’s overall supercomputing efforts. Knights processors were phased out three years ago, and Omni-Path is now taking a new course under Cornelis – one that Murphy says is better suited to the current and future state of high-performance computing and storage.

A little history on the InfiniBand protocol is in order to fully understand the turn that Cornelis will take with his implementation of Omni-Path InfiniBand.

“InfiniBand’s verb-based software infrastructure is really based on InfiniBand’s original goals, which were to replace PCI-X and Fiber Channel and possibly Ethernet,” Murphy said. The next platform. “The verbs weren’t structured at all for high performance computing. PathScale created Performance Scale Messaging, or PSM, which was completely independent of InfiniBand verbs and was a specific parallel transport layer focused on HPC. In business, when I’m talking to 40 or 50 disk drives or 40 or 50 pairs of queues, I can cache it on my adapter and it works great. But in HPC when I have a node with a hundred cores and a thousand nodes it becomes a giant scalability problem that we just can’t handle in the adapter cache. PSM could do it better, but even this was invented two decades ago and the world has continued to evolve. We’re seeing the convergence of HPC, machine learning, data analytics, and there are also accelerators as well as processors in the mix now. “

Fortunately for Cornelis, about seven years ago, researchers and technologists who were part of the OpenIB Alliance founded in 2004 created the OpenFabrics Interfaces working group to extend remote direct memory access (RDMA) and kernel bypass techniques, which give InfiniBand and RDMA over Converged Ethernet (RoCE) their low latency to supplement their high bandwidth, to other types of networks. the libfabric library is the first implementation of the OFI standard, and it is a layer that sits above the network interface card and OFI vendor driver and between MPI, SHMEM, PGAS and other memory sharing protocols typically run on distributed computer systems for HPC and AI. It looks like this:

“All major MPI implementations support libfabric, as do the various Partitioned Global Address Space (PGAS) memory overlays for distributed computing systems, including OpenSHMEM from Sandia National Laboratories as well as PGAS implementations for Mellanox InfiniBand, Cray Gemini and Aries plus Chapel, and Intel Omni-Path interconnects. Verbs and PSM need to be replaced with something and OFI is. OFI is not only designed for modern applications, it is designed from the ground up to take into account not only processors, but also accelerators, in nodes. This OFI layer is a perfect semantic correspondence from the network to the application layer.

At this point, Cornelis’ team, which has doubled in size to over 100 since the company was unveiled in September 2020, have created a vendor pilot for OFI libfabric that runs on Omni-Path adapters. 100 Gb / s, which are now being captioned Omni-Path Express. This adapter can handle 160 million MPI messages per second and approximately 10 million messages per second between two cores running on two separate server nodes connected by the network. Murphy says that at best with any implementation of InfiniBand, you could see 3-4 million messages per core, or between 2.5 and 3.3 times the bandwidth per core. (Obviously, to keep up with the growing number of cores on processors and the higher performance of each core, Cornelis needs to get much more powerful Omni-Path adapters in the future.) As for latency, on the small message sizes, which is the most difficult to improve latency, a heart-to-core round trip on the Omni-Path Express network is now in the order of 800 nanoseconds, which is 20% less than the outward journey – 1 microsecond return using the old PSM driver. For HPC and AI workloads, these are big improvements in bandwidth and latency.

Cornelis also focuses on costs. In most InfiniBand implementations, it’s better to have one port per socket than to have one port running at twice the speed, and we think you want to physically hook up every port on every socket if you can. (This is the raison d’être of TrueScale back to the days of the InfiniBand QDRCornelis states that a network in a cluster using one-port 100 Gb / sec Omni-Path adapters and 100 Gb / sec Omni-Path switches will cost 55% less than the cost of a 100 Gb / sec Nvidia switch. sec HDR InfiniBand Quantum and ConnectX-6 single port adapter setup. For a dual rail network implementation, where each socket has its own dedicated port, the Omni-Path configuration is still 25% cheaper.

The Omni-Path Express adapters and switches are currently in technical preview with about 20 customers and probably around November, just in time for the SC21 supercomputing conference, Cornelis will have this updated Omni-Path stack. This will be good news for the roughly 500 customers around the world who have Omni-Path networks at the heart of their clusters. The new OFI feature may be updated with firmware and help customers improve performance without affecting their hardware at all.

As for the future, it looks like Cornelis will skip the generation of the Omni-Path 200 200 Gb / s series that Intel was working on and silently put on the back burner in July 2019. This second generation of Omni-Path had to incorporate more Aries interconnection technology. and apparently would break backward compatibility – which is a no-no. Murphy says Cornelis is working on an OFI adapter card that has four lanes operating at 100 Gb / sec effective per port. We assume that the companion Omni-Path Express switches could have between 48 and 64 ports operating at the maximum speed of 400 Gb / sec and twice that number of ports operating at 200 Gb / sec. These future Omni-Path Express switches and adapters are expected to hit the market in late 2022, and we also believe these chips will use 5-nanometer monolithic chip designs and use Taiwan Semiconductor Manufacturing Corp as a foundry. Much like Intel did with the original Omni-Path chips. There is an out-of-the-box possibility that Intel will one day be a foundry partner for Cornelis, but not anytime soon, with Intel having delays with 7-nanometer processes and not speaking much at 5 nanometers – let alone 3 nanometers.

Leave a Reply

Your email address will not be published. Required fields are marked *

*