As a follow-up to the first article of this how-to series, this article will be about selecting the proper tests, aggregating the results and performing smoke tests.
Selecting the proper test patterns
Each benchmarking tool has its own features and options. It is important to review them and choose the appropriate ones to make the tool performing the test you want. It is now time to choose, for each of the selected tools, the proper test patterns.
CPU
The CPU part of Sysbench returns a number of loops per seconds executed at computing prime numbers by using 64-bit integers. It’s a pretty simple computation task for a modern processor but it’s barely enough to estimate if a given series of processor reports the expected performance. This value can be determined in a couple of seconds, it is not necessary to run it for a long time.
In AHC, each single core is tested for 5 seconds. All cores are then tested simultaneously to estimate processor’s computing power scalability.
Memory
The memory benchmarking used with Sysbench could be about reading or writing blocks of a given size. In AHC, we choose the write test to avoid any cache effect and to check how the system behave for various block size : 1KB, 4KB, 1MB, 16MB, 128MB, 1GB, 2GB. The bigger the block is, the better you’ll find the raw limits on the memory channels associated to a given processor. The bandwidth for a given block size can be determined in a couple of seconds.
Storage
About storage, two main patterns are used to characterize a block storage device. The number of IOPS the device can handle and the bandwidth it produces. The first one is done by asking random IOs of 4KB while the second one uses contiguous IOs of 1MB.
On rotational disks, random 4KB IOs forces the disk to seek a lot and gives the worst workload possible while the 1MB IOs make disks seeking much scarcer and provides the highest possible bandwidth.
On non-rotational devices like SSDs, the random IO pattern does not affect disk’s performance as getting two non contiguous IOs doesn’t have a seek penalty. The 1MB pattern might be less efficient on some disks, since accessing a set of contiguous blocks in the virtualized addressable space doesn’t mean the SSD stored all the data on some adjacent cells. Depending on the SSD technology (SLC vs MLC) and the firmware implementation, the performance could be equal or worse than the random pattern.
Networking
Testing the networking is a much more complex task. CPU, Memory & Storage tests are located in the same physical server and doesn’t depend on anything else to be performed. Checking the network performance requires a cooperation of several servers and the network switch in between. The switch is clearly the hidden factor of the stress test here. This component is clearly out of control during this test as we don’t run any custom software on it. It means that a change on its configuration can dramatically impact the results.
That lever the question : “How does this switch performs ?“. Making a 1-to-1 connection between two servers through this switch isn’t enough to understand if the switch is a limiting component.
To get a clear picture of the network performance, it’s important to get a group of servers synchronized to perform a simultaneous stress of all links and switching rules between all servers. On a 4 servers setup, each node contacts the 3 other nodes. The global bandwidth of each server is computed by the addition of the 3 generated streams.
Checking the standard deviation between all the 12 streams is a key element to check the fairness of the switch. All streams have to perform almost equally unless the performance distribution is not done fairly. Considering a server that deliver 3 streams of 3GBit/sec *(9Gbit/s) versus another that deliver 8Gb+0.5Gb+0.5Gb = 9Gbit/s, the first one is very equitable while the second one isn’t fair enough with two streams. The induced impacts by the latter case on a real infrastructure would be dramatic as some servers would be underprivileged leading to an very unbalanced distributed load. Some services or VMs would be much more impacted than other.
The total switching bandwidth generated by this test is equal to the sum of all streams generated during this benchmark session.
In AHC, this methodology is implemented using netperf like in the following description :
- nodes are discovering its peers by using multicast until the number of expected servers is found (4 in our previous example): this prevents us from having to manually define which host should participate or not
- once all nodes are discovered, they establish the node list, the IPs to be used and the ports to open for each peer server
- all servers are opening the expected ports by starting the netperf server
- the leader send the start signal to all nodes for a time based benchmark
- all nodes sends a TCP traffic to the others servers (3 in our previous example)
- results are generated and padded with the other benchmarks results
- network test is over
Aggregating results and saving them
Once all our performance tests are completed, it’s time to aggregate them in a single format and upload them to a central location. To ease the later analysis of servers, it is convenient to have our own format and use it for all information we have to report. Every single tool used for benchmarking has its own output format which could be plain text, csv, specific raw/column data. Having a simple container where we store our results in a single format eases the later work of the analysis tool. It also makes the uploaded result smaller.
In AHC, we use the format defined by eDeploy’s bootstrap: a list of tuples where each element is a hardware description or a performance.
The output format looks like the following:
[ (‘cpu’, ‘physical_0’, ‘physid’, ‘400’),
(‘cpu’, ‘physical_0’, ‘product’, ‘Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz’),
(‘cpu’, ‘physical_0’, ‘vendor’, ‘Intel Corp.’),
(‘cpu’, ‘physical_0’, ‘frequency’, ‘2000000000’),
(‘cpu’, ‘physical_0’, ‘clock’, ‘100000000’),
(‘cpu’, ‘physical_0’, ‘cores’, ‘8’),
(‘cpu’, ‘physical_0’, ‘enabled_cores’, ‘8’),
(cpu’, ‘physical_0’, ‘threads’, ’16’),
(‘cpu’, ‘physical_1’, ‘physid’, ‘401’),
(‘cpu’, ‘physical_1’, ‘product’, ‘Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz’),
(‘cpu’, ‘physical_1’, ‘vendor’, ‘Intel Corp.’),
(‘cpu’, ‘physical_1’, ‘frequency’, ‘2000000000’),
(‘cpu’, ‘physical_1’, ‘clock’, ‘100000000’),
(‘cpu’, ‘physical_1’, ‘cores’, ‘8’),
(‘cpu’, ‘physical_1’, ‘enabled_cores’, ‘8’),
(‘cpu’, ‘physical_1’, ‘threads’, ’16’),
(‘cpu’, ‘physical’, ‘number’, ‘2’),
(‘cpu’, ‘logical’, ‘number’, ’32’),
(‘system’, ‘ipmi’, ‘channel’, ‘2’),
(‘cpu’, ‘logical_0’, ‘bogomips’, ‘4000.26’),
(‘cpu’, ‘logical_0’, ‘cache_size’, ‘20480KB’),
(‘cpu’, ‘logical_0’, ‘loops_per_sec’, ‘452’),
(‘cpu’, ‘logical_1’, ‘bogomips’, ‘3999.88’),
(‘cpu’, ‘logical_1’, ‘cache_size’, ‘20480KB’),
(‘cpu’, ‘logical_1’, ‘loops_per_sec’, ‘454’),
(‘cpu’, ‘logical_2’, ‘bogomips’, ‘3999.88’),
(‘cpu’, ‘logical_2’, ‘cache_size’, ‘20480KB’),
(‘cpu’, ‘logical_2’, ‘loops_per_sec’, ‘455’),
…..
]
The first part of this file is about describing the processors features while the second part reports the performance of each logical core. The complete file contains much more description and performance results for each detected and tested component.
Each single file is named with the product name, vendor and serial number and uploaded to a central server. It comes very handy to define a tag that keeps results ran during a single session grouped all together. On the server side, the resulting file name looks like “ProLiant-DL360pGen8654081B21-HP-CZ3323FDVH-d8-9d-67-1b-07-e4.hw”
In AHC, the SERV= parameter can be used to define the IP/name of the server to contact for the upload process and the SESSION= parameter to define the tag to be used to store all results from a single run together.
Performing stress tests before starting the production
Once raw performance has been been verified, all components should then be properly inspected to get a precise definition of their respective capabilities. It is time to see if the hardware is ready to go into production. It’s a well known phenomenon that some weak hardware dies in the earliest days/weeks of its life. It is always very frustrating to setup a server, configure it, put it in production to then have to do some maintenance on it so quickly.
A way to make this happen before putting a server in production is to massively stress all its components for a given time – a.k.a. burn-in tests. As we now have benchmarking tools that can stress each component individually, it is possible to start them all together to stress all the devices at the same time for a given duration.
The induced load should generates some preliminary wear, heat, power peaks and vibrations if rotational disks inside. After 48 or 72 hours of this treatment, if some weak hardware is present, chances are that it will fail or perform badly within that time frame.
The implementation to perform burn tests in AHC is done as follow:
- a first benchmark is done on every component, results are uploaded to the server
- all components are loaded simultaneously by using the SMOKE=<x> variable. The server is stressed for <x> minutes.
- a second benchmark is done on every component, results are uploaded to the server.
- If a component, such as a storage block device died, it will not be visible on the second benchmark.
- If a component is degraded, such as a slow but working block device, the performance gap between the first and the second test will reveal the issue.
- If no failure or performance degradation can be found, the server is ready to go into production.
Summary
By implementing the key elements to get a consistent benchmark, we now have an efficient test series that can perform multiple CPU, Memory, Storage and Networking benchmark with a very easy output to parse for later analysis.
The resulting tool can be shared across servers, with various Linux distribution, immediately and anytime we will want to perform a differential analysis, to understand what component could be the cause of some under-performance. It is also useful to ensure that servers about to join a farm will perform as much as the others, avoiding introducing weak elements in the infrastructure.
Moreover, the same tool can be used to peform burn-in load tests to avoid premature failure of weak devices, which would result in maintenance costs for the production team.
Everything presented in this article is implemented in AHC and available on edeploy’s github.
In the next blog posts on this subject, we will explore:
- Analyzing a series of benchmark results
- How to use this benchmark tools to estimate virtual machines’ performance
I can wait for the remaining part, starting with the analysis 🙂
Ay update on part 3.
This is insanely usefull
Any link for part 3?!
I’m sorry I ran out of time to write it. I’ll save some time to make it.