Using averages, percentiles and modes

Life isn’t always perfect… and neither is any system which is based on sampling of real-world systems. In many cases, the samples do not render a perfect normal distribution, but rather a skewed one. If we are to tune a system, whether manually or using an automatic tuning toolkit, we still need a good way to estimate the performance of a system under different configurations. This will allow us to determine which configuration performs best.

Taking numerous samples is always a good thing (as long as you can afford the sampling time), but if your samples have an attached skewed noise, the entire distribution is skewed.

We have skewed distributions, what’s the problem with that?

The problem with skewed distributions is that you need to generate a single representative value out of the samples – this is known as point estimation. 

In a normal distribution, finding a point estimation is rather easy – we can simply take the midpoint:


The midpoint in normal distribution is the point where all 3 M’s meet. The 3 M’s are: Median, Mean and Mode. The Median is the midpoint of a normal distribution, representing a value which is larger than half of the population and smaller than the other half of the population. Mean refers to the average of the population and Mode is the most likely value to occur.

Therefore, the most common point estimation when dealing with a normal distribution, is the midpoint, which is easily estimated by the average of the samples.

However, when dealing with a skewed distribution, the 3 M’s get separated and deciding on a point estimation gets trickier. For example, in a negatively skewed distribution, things should look like the following diagram:

Which value should we select as a representative?

  1. The mean? 
  2. The median (the one which is in the middle of all values)? 
  3. The mode (the value with the highest occurrence probability)? 
  4. Something else?

Well, there is no simple answer, as it depends on the nature of the experiment you run. Both the median and the mode are regarded as resilient to outliers. This means that extreme sample values (high or low) do not affect the median and the mode values in a significant way (or even at all). On the other hand, if these extreme values are not outliers – the median and mode will lose some of the information.

It is also important to note that it might be possible that none of the 3 M’s are relevant – For example, if you wish to optimize the Frames Per Second (FPS) of a video game, it is reasonable that you would be interested in optimizing the low 10th percentile.  Additionally, if you wish to break a performance speed record, it is reasonable to select the 95th percentile. These cases are not related to skewed distributions, but they use similar tools – such as percentile, which may be used to counter skewed distributions.

We decided on a representative value – how do we calculate it?

As we only have samples, we need to estimate the relevant population’s value. Let’s start from the easiest value to estimate and then escalate the difficulty:

Mean value

Estimating the mean value is the easiest of the three. As long as the samples are randomly selected, the mean of the samples will tend to converge with the mean of the population. Hence, a simple calculation of the arithmetic mean of the samples should be good enough.

Median value

Calculating the median value is only slightly harder than calculating the mean value. Again, assuming that the samples were selected randomly, the median value of the samples should converge with the median value of the population. Therefore, all we need to do is to sort the samples in size order, and pick the middle sample.

This will work well with an odd number of samples, but an even number of samples will introduce a problem of non-integral index numbers. For example, if we have 100 samples, then the median is the item with index number 50.5. Moreover, if we wish to extend our estimation for any given percentile (in contrast to calculating the 50th percentile), then we face the same problem as we do with odd number of samples.

There are several ways to handle scenarios where we need to select a non integral index i :

  1. Select sample[⌈i⌉], where i denotes rounding i up to the nearest integer
  2. Select sample[i⌋], where i denotes rounding i down to the nearest integer
  3. Select sample[⌊i+0.5⌋]
  4. Generating the value (sample[⌊i⌋] + sample[⌈i⌉]) / 2
  5. Generating the value sample[⌊i⌋] x (⌊i+1⌋-i) + sample[⌈i⌉] x (i-⌊i⌋)

All of these methods are valid. Method 3 (shown above) is called, index rounding, and that’s what Optimizer Studio uses.

Mode value

In most cases, the sampled values are taken from a continuous spectrum of values. Hence, we can easily end up with a set of unique values. 

If each of the sampled values appears only once, how can we estimate the mode value?

At first, one might suggest that we should use histograms in which we group the values into bins of equal breadth and then take the mid value of the most loaded bin. The problem is that this method can be misleading, as the following example demonstrates:

Suppose we have the following sample: 2, 2.5, 11, 12, 14, 16, 18, 19, 20, 23, 24.

If we group the items using bins with a breadth of 1, we get the following:

According to this histogram we should estimate the mode as 2.5.

However, if we use bins of breadth 2, the result histogram looks like this:

Now, we have 2 candidates for mode: 3 and 19. Which one of them should we choose?

Finally, if we use bins with a breadth of 3, then we get a new candidate with a value of 19.5:

So we can deduce that  using a histogram is problematic.

A more robust way of estimating the mode is called Least Median of Squares. In this method we look at the sorted samples x1,x2,…,xn and search for i such that xi+n/2xi is minimal. The estimation for mode is:

xi+xi+n/2
2
xi+xi+n/2

2

In simple words: We sort all the samples and then find a sequence of n/2+1 samples, which minimizes the distance between the first and the last item. The arithmetic mean of the first and the last samples of this sequence is the estimation of the mode.

This method is known as a robust (ie., resilient to outliers) estimation of the mode.

See this paper for more details and other methods for estimating mode.

Configuring Optimizer Studio

By default, Optimizer Studio uses average as a point estimation, however you can change that behavior easily by editing the configuration file (knobs.yaml). 

Configuring the point estimator function is done under the “global_settings:” tag. For example, if we wish to use the median as a point estimation, the configuration should be:

global_settings:
   point_estimator:
      percentile: 50

If, for example, you are optimizing the frame rate per second of a game, and you wish to discover the best configuration with the highest 10th percentile frame rate, then the configuration should be:

global_settings:
   point_estimator:
      percentile: 10

If your sampling suffers from outliers, you can use mode as a point estimation:

global_settings:
   point_estimator:
      mode:

It is also possible to set the point_estimator to the average:

global_settings:
   point_estimator:
      average:

But, there is no need to do so, as this is the default option.

Pin It on Pinterest