Performance does matter
When speaking about cloud infrastructures, many think about the amount of servers, storage or number of virtual machines to be used. More than the quantity of hardware you select, the efficiency of your global solution is the key to get a powerful platform. This first article will be about defining why you should understand, check and diagnose the performance of the selected hardware.
Setting up the project
A cloud infrastructure as a service project is made of physical servers, networking components and some software (Linux & Openstack).
The infrastructure is expected to provide Virtual Machines (VM) that users will use to run their applications. The number of VM running on a compute server depends on the number of compute servers available: if you expect to run 100 VMs on your 4 node infrastructure, every single compute server will host 25 VMs.
Servers are usually configured to handle this workload by choosing the proper processors, the good amount of RAM and the appropriate storage technology (SSD vs HDD, 10Krpm vs 7.2Krpm, 3.5″ vs 2.5″, etc…). They are properly racked, connected to the power supply & networking switches and you are ready to deploy your operating system on top of them to start your cloud activity.
Is your setup really ready to go into production ?
It’s now time to ask yourself this question : “Are all my servers performing the same ? Is the level of performance the expected one ?”
If you bought those servers together, the usual answer is: “Yes, they are supposed to be and perform the same“. Is it enough to get into production ? What would be the consequences of a mostly-failing component on the global infrastructure ?
Until you have checked that your servers are performing the same and as much as expected, it’s pretty risky to put those servers into production. Many considers that if the operating system is properly installed and doesn’t crash in the following minutes/hours it means the server is stable enough to get in service. This approach is pretty prone to failure as it could hide some serious issues that will not be detected before people are complaining: “Hey ! My VM is really slow. Can you check why ?”
A scenario that’s easy to understand is having a disk that performs at 10MBytes/sec instead of 100MB/sec. The disk is perfectly usable, the system is stable but it’s 10 times slower than expected. If this under-performance is not detected, VMs that uses this storage device will wait a lot to get an IO done and will be perceived as laggy. Determining the reason why the VM is laggy means checking many different components and configuration while all of them are being used by many VMs at the same time. Getting at the conclusion that this single disk of a particular compute server is the root cause as it is under performing could be a long and difficult task.
Knowing your hardware is the key element to avoid this situation.
Time to deeply inspect your hardware setup
Once your hardware is properly setup and cabled, it’s really important to take some time to do a complete analysis of your platform performance. The following components should be verified :
- CPU computing power
- Memory bandwidth
- Storage IOPS & Bandwidth
- Network bandwidth
The benchmark methodology should be strict, reproducible and eradicate any possible source of doubts.
Benchmark tools should not mix several components
To ensure a good analysis of every single component, it’s important to select a tool that reduces as much as possible interactions between many components during the run. Selecting a CPU benchmark tool that reads data from the disk, stores it in memory and compute it doesn’t sounds like a good approach as many devices will be used during the test and could be a source of under-performance while we want to analyze only the processor performance.
Constant time benchmarking
Another mandatory point, is having tools that support time-based benchmarks. It’s a common mistake to use tools that try to see how long it takes to process a given amount of data. Benchmark results are usually expressed in ‘unit per time‘ like MegaBytes per seconds, GigaBit per seconds. If time is not a fixed element, the benchmark aren’t really comparable : processing 1GB of data on a system that consumes them at 100MB/sec last 10 seconds while it will take 100seconds on another that performs at 10MB/sec.
Comparing both results when comparing a 10sec run versus a 100sec run. This huge difference of running time can hide or reveal various unexpected events like a crontab running in background. Another annoyance is the unpredictability of the required time to run a particular test on a set of non-similar servers. Fixing the time for a test answer the question “How much data can I process in this amount of time ?” instead of “How much time do I need to process this amount of data ?”
Don’t trust humans
Automation is a key element on the success of a good benchmark suite. Benchmark tools are usually offering various options and usage. Selecting or missing a particular option could totally change the meaning of a test.
In some storage testing tools, if you forget to disable the use of the Linux cache, you have a great chance of testing more your memory than your disk. If you are not aware of this behaviour or if you missed the setting, the interpretation of results could be very misleading.
Humans are weak machines, even if you read something wrong, your brain with make you read what you expected to read more than the mistake. A great example of this effect is shown in the following images.
To avoid any human mistake, having a tool that runs automatically a set of defined commands is an important protection against any misuse of tools leading to wrong results.
Kill any source of doubts on software
Mastering your software configuration is required to get consistency over time and systems. Trying to estimate the performance of a given hardware requires the benchmark tool to be the sole one using a particular resource. The more processes will use this resource at the same time as the benchmark the less reliable will be the result. It’s pretty obvious that performing some rsync/logrotate/database IOs while trying to estimate disk’s performance isn’t a good way to get a coherent result. Those example are pretty obvious, but at the time you run your benchmark, it could be a pretty complicated being 100% sure that not a single non-expected task ran. On an infrastructure which is in production, this could turn into a complex task disabling all possible sources of annoyance.
The way to go is embedding all required tools and automation scripts into your own live operating system. The easiest way to get a clean operating system for a benchmark, is to generate one with the minimum dependencies. It’s almost like creating a minimal system (like debootstrap on debian), install the benchmarking tools you need, no graphic server and for sure, no crontab at all. Once this minimal system is setup, create a ramfs with it and boot on it with your favourite bootloader (pxelinux, extlinux, grub, …). This steps can be done automatically by some project like edeploy
Having an under-control operating system that will be the same over servers and time remove any possible doubt of a background process running at the same time as your benchmark. It become possible running the benchmark in a controlled fashion on an already installed server. If you have any doubt of a particular hardware, reboot the server in this under-control operating system, perform the benchmark and voilà.
Results should not be stored without context
Keeping the hardware description/configuration attached to your performance results is an efficient way to “remember” what was the context. It could be used to determine that a particular under-performance could be linked to a hardware change or configuration. The more details about your hardware you have, the easier it will be to determine the link between a change and a performance issue/increase.
To be continued
My next blog article will be about presenting a tool that implements all this ideas/concepts