When considering setting up a Cloud infrastructure, the network interconnect is one of the key element. This is especially true when thinking about the storage part of it like Ceph.
1Gb Ethernet is now a commodity setup for a long while 10Gb Ethernet remains an option. It’s now time to think out of the box. What if the 10Gb isn’t enough ? Can I afford another setup to unleash the full potential of a server ?
This article will not be about using Infiniband with OpenStack or from the Virtual Machines. It will be about setup Infiniband on 3 nodes and estimate what we can achieve with it. We’ll stay focus on the interconnect performance and keep as close as possible to the hardware.
What setup to start ?
In our case, we are running 3 HP DL360 Gen8 servers equipped with a dual-port ConnectX-3 Mellanox (MCX354A-FCBT) card connected to SX6018 Mellanox switch. This card is featuring Dual FDR 56Gb/s or 40/56GbE. It’s also important to notice that those cards are connected on a PCI Express 3.0 8x bus.
All servers are running an Ubuntu 12.04. Then it’s up to choose the Infiniband drivers you want to use. Two options exists :
- Using built in drivers from Ubuntu
- Using Mellanox drivers
Using built-in drivers
The first solution is pretty straight-froward to setup as only a few set of packages needs to be installed :
apt-get install infiniband-diags ibutils ibverbs-utils qlvnictools srptools sdpnetstat rds-tools rdmacm-utils perftest libmthca1 libmlx4-1 libipathverbs1
Using Mellanox drivers
Mellanox is providing on their download page a series of tarball featuring pre-built packages for the main Linux distribution like Redhat, Suse, Ubuntu and Fedora. This pre-built packages have a few advantages :
- Up-to-date hardware support
- Updated firmwares (Firmwares can also be downloaded from this page)
- Latest features
- Kernel modules compiled via DKMS
Considering the Ubuntu 12.04 case, the main drawback is the need of downgrading the kernel from a 3.8 kernel series to a 3.5 one. It sounds this will be fixed in a couple of months.
The installation process is pretty easy like :
./mlnxofedinstall –enable-sriov –force-fw-update
As you can notice, this script will flash the latest firmware provided in this set of packages (2.30.3110 in our case) with SR_IOV enabled. To get more information about SR_IOV that will not be discussed here, you can watch this video or watch this presentation.
Note that under Ubuntu 12.04, you need the Mellanox drivers to get a good support of SR_IOV
Configuring the adapter
Regarding your cables, your adapter can work in Ethernet or Infiniband mode. This setup is port-based and can be tuned via the port_type_array option of the mlx4_core kernel module. In our case, we use the Infiniband (1) port type.
This quick benchmark will be done on top of the IP, so we need to activate the IPoIB feature by setting IPOIB_LOAD=yes in /etc/infiniband/openib.conf and restart the openibd service.
With SR_IOV disabled (by default), two network devices appears : one per physical port. With a dual ports card, ib0 and ib1 interfaces are available and could be used like any other network device.
Infiniband adapters can run in two different modes : connected or datagram.
To choose one or the other you can do the following :
mode=datagram; ifconfig ib0 down; echo $mode > /sys/class/net/ib0/mode ; ifconfig ib0 up
To make this choice persistent, select the proper value of SET_IPOIB_CM in /etc/infiniband/openib.conf.
Using datagram, the MTU is set to 4K while the connected mode uses a 64K MTU. Note that in Infiniband mode, the MTU cannot be tuned via ethtool.
Running the benchmark
In this quick tour, we’d like to understand how much data a server can manage when using the IPoverIB feature. Infiniband provides a VERB API to do low-level IOs but very few applications are able to use it. Using the IP interface, all IP-capable application will have the benefits of Infiniband even if the performance will degraded regarding what the VERB API is capable-of.
All benchmarks were run using iperf3 in TCP, bidirectional mode on a 30 seconds basis.
The benchmark procedure details will use a very simple meta-language to increase precision and reduce verbosity.
To estimate the bandwidth one server can achieve, the test will be run like the following :
for stream_nb in 1 2 3 4 6 8 16 32 64; do
benchmark server2 with $stream_nb streams
Then we run the same test with two servers as clients:
for stream_nb in 1 2 3 4 8 16 32; do
benchmark server2 with $stream_nb streams &
benchmark server3 with $stream_nb streams
Those two tests are run in both connected and datagram mode :
Datagram vs Connected
In both configuration (datagram or connected) :
- the bandwidth handled by the host is divided in two equal parts : sending and receiving data. As a result some plots are overlapping : only one of the two traces (Recv or Sent) is visible.
- the maximum bandwidth is reached by using two clients servers
- using two clients provides a 2x performance increase
The connected mode reports a 56Gbit/sec traffic up to 8 simultaneous streams. This amount of data is the maximum we can expect from this setup. In this configuration, the Infiniband setup is providing 2.5x more bandwidth than a 10Gb setup can achieve.
Latency is also one of the key strength of Infiniband. On this short tests, measured TCP latencies were between 13 and 17µs while between 40 and 50µs on Ethernet.
This quick tour of Infiniband demonstrated that :
- Installing and configuration Infiniband is very easy
- In bidirectional mode, bandwidth could be 2.5x better than a 10GbE setup
- TCP Latency is divided by almost 4 times
Infiniband provides a high bandwidth and low-latency interconnect that could be used as a backend for IP-based application. Distributed storage solution or high-demanding application could have the benefit of such solution.