As we discussed in past blogs, InfiniBand is the communication standard – the protocols and specifications governing how data is transmitted and received within a network. We also discussed how transmission media (twisted pair copper cables, fiber optic cables, etc.) encompass the physical pathways through which data traverses.

Catch up on parts one and two of the series:

But now let's talk about the rest of the cabling infrastructure that enables InfiniBand networks. While we are focusing on InfiniBand here, note that the same conceptual infrastructure can support RoCE (Converged Ethernet) and other methodologies.

Bridging the gap between the technical intricacies of InfiniBand and its broader implications

As we delve into the cabling infrastructure that supports InfiniBand networks, it is essential to consider the overarching goals. We are not just laying cables; we're architecting networks that fuel the capabilities of generative AI and advanced high-performance computing. By understanding these objectives, we can tailor our network designs to meet the demanding requirements of cutting-edge computational tasks.

Network infrastructure is being pushed to limits we haven’t seen before – with 400g/800g speeds commonly utilized. We must ensure we are designing infrastructure correctly so that GPUs can work flawlessly and in sync with the rest of the network. As we know “you’re only as robust as your weakest link.”

In earlier releases of the InfiniBand series, we discussed the transmission media used in the InfiniBand network. The description then was a point-to-point reference to keep it in layman’s terms for a single connection. When deploying InfiniBand or RDMA (remote direct memory access) over a converged Ethernet network, you are building the GPU clusters and superclusters as a singular neural network to share GPU cycles. When building these generative AI networks, many are built in a leaf-spine and super spine architecture.

As an example, you may be deploying an InfiniBand protocol in the architecture, but the AI network consists of GPUs connecting to your leaf switches. Those leaf switches then connect to spine switches (if you are scaling to super nodes and superclusters) which then have connections into superspines to create superclusters. The intricacies can be a real cluster if you know what I mean without the knowledge and understanding of the complete infrastructure and support systems.

New levels of testing, cleaning, and cable management are imperative to get the results you expect to obtain. At over US$250k per H100 server, the thought of skimping on the infrastructure is mind-boggling. You have to ensure the quarter of a million-dollar device gets maximum utilization. 

The fact that the devices operate in harmony makes cabling issues have a potentially greater impact on the overall network. When it comes to dirty interfaces – micro or macro bends in your fiber – they can cause the cluster to have to rerun a scenario, resulting in an exponential financial impact on your investment due to the cascading effect across all devices in the cluster.

Maintaining flexibility in 2024 and beyond

It used to be stated that infrastructure work refreshes happen every three to five years, but that is no longer true. With GPUs coming out with multiple iterations within the same calendar year, customers are left with a dilemma. Do you run the network that you deploy now until the end of its usable life cycle, or do you continually evolve the infrastructure that you have?

If you choose to run the network until the end of its usable life, then point-to-point cabling could be sufficient. But if you are planning on continually evolving your network, and capitalizing on your sunk-cost investment, a structured approach might be the best fit.

This is not an easy conversation because there are many variables that should be addressed before the client can make the most informed decision as to what is best for them.

A deeper dive into a structured approach

So, let's assume the client has decided a structured approach is the best fit for them. How do you do it?

Fundamentally, you look at areas that are the most stable and consistent, versus the areas that will consistently change.

Determining where the delineation of the different segments of infrastructure will allow you to aggregate and segregate the infrastructure at multiple points throughout the network. This means you can make changes at certain segments while still utilizing the core and permanent link infrastructure from the initial deployment.

If we are talking about fiber connectivity like an MPO-8 at the GPU end, that doesn't mean it has to be MPO-8 throughout the entire network. Think about the connection to the outside world from your WAN network. It is more than likely not going to be an MPO-8 interface but rather a duplex LC.

That is to say, just because the equipment dictates the endpoint interfaces, it doesn’t mean that it has to be the same interface throughout the network.

The new small form factor, multi-fiber connectors, and high strand count with small OD fiber make this even more exciting.

As an illustration, imagine fitting 3,456 fibers in a cable that is slightly larger than the average thumb. So instead of running 432 MPO-8 cables between the spine and super spine, you could have that in one cable with less than a 1 1/2" diameter.

The larger capacity fiber is more than just the outer diameter of the cable, it also keeps them bundled together and provides “protection” from the stresses of pulling multiple cables over each other. Combine that with some of the higher-density fiber enclosures and you can create density and eliminate wasted space.

Then from the spines down, you could utilize a 576-stand cable with a 4U 72 port MPO-8 interface mounted above the cabinet positions to be able to have everything ready while you wait for your next shipment of GPU servers to arrive and get them turned up faster than a traditional method.

What would have taken a whole cabinet, can now be deployed above the cabinet in a zero-u space. That is just one example, but the options are endless and are based on the variables of the client’s intent and what the facilities are capable of supporting.

A note about testing

I previously mentioned testing. I can't focus on this enough. One fact is that factory-terminated cables will never perform better than the day that it passed testing at the factory and shipped to the site for installation.

Bouncing around the shipping trucks, the stress of unpackaging and placement, the routing into the proper housings, cabinets, and more all are risks to the cable and its performance. You must have quality control in the receiving, unpackaging, and installation, but testing provides the peace of mind that it will work.

In an Infiniband network, remember that you are utilizing eight lanes of light working in harmony to get up to those 400/800gig speeds. Adequate cleaning must be part of the testing requirements. Fragments of dust, unseen haze, etc can refract and scatter the light. You also have to worry about the short-reach reflectance of the transceivers. Power sources and light meters are not sufficient for testing.

It is always a shame to see companies make huge investments in these technologies, only to be unable to leverage them to their full potential. Align’s team of experts helps clients fix and avoid their mistakes and realize that potential – just one of the reasons we are a valuable partner for projects like these.