How to determine which configuration performs better
It is said that the definition of insanity is doing the same thing over and over and expecting a different result. But as any performance engineer would tell you, running the same workload again and again does indeed yield different performance results every time. This is because servers and applications are very complex and inherently noisy.
In the journey to discover well-performing server settings, performance engineers experiment with various configurations of the different tunables. In a world without performance variability, each configuration could be tested once and an arithmetic comparison between the results would then be used to determine which configuration performs better. Real-world scenarios, however, are full of unknown factors and variability, so comparing performance between configurations is no mean feat.
Being able to correctly identify when a configuration performs better than another is crucial for performance tuning, and if variability is not properly taken into account, incorrect tuning decisions may follow.
Variability in Mathematical Terms
The standard deviation of the performance measurements (or samples, mathematically speaking) can be used to characterize the variability. A lower standard deviation is preferable because it implies lower variability of the samples, making server performance more predictable. Sometimes, however, the standard deviation is high and there is little a performance engineer can do to lower it without significantly throttling server performance. In other words, performance engineers need to work within the limits of inherently noisy servers and applications.
Usually, performance engineers deal with variability by iteratively running the same workload for a constant number of times and measuring the results. In many cases, the average (mean) of the measurements (samples) is used to represent the performance of a tested configuration. In other cases, especially when optimizing workloads with short requests such as web servers, a certain percentile of the samples is used to represent performance. Performance engineers then use this measure of performance to tell which configuration performs better. Intuitively, more iterations will result in a more representative performance metric (be it the mean or a percentile of the samples).
This intuition is backed by theory: What performance engineers achieve by running the workload again and again is lowering the standard deviation of the sample mean, as the number of samples appears in the denominator. So it is preferable to have many measurements of each configuration in order to reduce the “noise”.
That being said, performance tuning under a specific configuration can take a lot of time, depending on the workload being tested. Sometimes it can take hours for just a single run. So even though performing more measurements increases accuracy, doing so is still a tedious and time-consuming task.
When a Constant Number of Measurements is Not Enough
Even when performing a constant number of measurements per configuration, it may not be enough to just compare the means or percentiles and choose the configuration with the better figure. For example, imagine comparing two different configurations, A and B. If B’s mean is 10% better than A’s mean, does this imply that we should choose B? What if the samples swing by 1%? 10%? 20%? You get the picture.
To complicate things further, the statistical properties of the noise depend on the actual configuration being tested. For example, variability is higher when SMT is enabled in comparison to when it’s disabled. So besides considering the mean, performance engineers need to consider the statistical properties of the samples in each configuration as well.
In the below example, B’s throughput mean (220 tasks/sec) is higher than A’s throughput mean (200 tasks/sec) by 10%, and the measurements are pretty consistent, so it’s easy to choose B (red) over A (orange):
The Coefficient of Variation of B is relatively low (0.09) in the above example, but in a scenario where it is higher (0.2 as in the example below), it becomes much harder to tell which configuration is better. Moreover, in many real-world systems, high variability is unacceptable, even if on average the performance is better.
The thing is, it’s difficult to tell which configuration performs better without taking a scientific approach to performance tuning and considering the variability inherent to these complex systems.
The Scientific Way
This is where a tool like the Concertio Optimizer Studio comes in handy. It uses a scientific approach to determine which configurations perform better by automatically considering the statistical properties of the measurements. The tool measures a minimum number of samples per configuration, and then continues to iteratively run the workload until the variability of the sample mean is acceptable (below a threshold). When comparing between configurations, Optimizer Studio considers the coefficients of variation of the means, and makes sure to choose only configurations that really perform better, “above the noise level”. Sometimes, the variability introduced by certain configurations is too high to be acceptable in a production system, and the tool simply moves on to experiment with other configurations.
These methods and others allow Optimizer Studio to safely accumulate small gains from many system tunables that would normally be overlooked by performance engineers, in order to eventually provide performance engineers with higher speedups at acceptable variability levels.