Tuning Intel Compiler Flags for Performance

One of the main reasons for using a commercial compiler is for attaining better binary performance. But how can we achieve the highest performance? By using -O3? Not necessarily.

Let’s take Intel C++ compiler (ICC) for example. It’s a popular commercial compiler that produces highly optimized code that many times outperforms GCC (see this blog). ICC has hundreds of flags to choose from, clearly conveying to us users that for better performance, we’ll need to choose the right flags from that long list – otherwise, why are these flags there in the first place?

Intel refers to C++ code samples on their website, and we’ll choose Merge Sort for our evaluation. The benchmark is described on Intel’s website as follows:

  • Merge sort algorithm is a comparison-based sorting algorithm. In this sample, we use top-down implementation, which recursively splits list into two halves (called sublists) until size of list is 1. Then merge these two sublists and produce a sorted list. This sample could run in serial, or in parallel with Intel® Cilk™ Plus keywords cilk_spawn and cilk_sync. For more details about merge sort algorithm and top-down implementation, please refer to http://en.wikipedia.org/wiki/Merge_sort.

The Makefile of the benchmark includes a set of recommended compiler flags for use with ICC, as shown below:


CXX := icpc
SRCDIR := src
BUILDDIR := release
CFLAGS := -O3 -ipo -qopenmp -std=c++11
EXTRA_CFLAGS :=
LIBFLAGS := -qopenmp

The flags above are consistent with those Intel recommends in its “Quick Reference Guide to Optimization with Intel® C++ and Fortran Compilers v19.1” for tuning for application performance. These recommendations are:

  • Setting general optimization flags (-O0, -O1, -O2, or -O3)
  • Targeting specific processor architectures (e.g. –xsse4.2) for targeting Intel core processor families
  • Setting interprocedural optimization (-ipo) or profile-guided optimization (-prof-gen and -prof-use)
  • Setting flags for simultaneous multithreading, multi-core and multi-processor systems using (-parallel) or (-openmp)

The Merge Sort benchmark has a serial and a parallel (OpenMP) versions and it runs them both. When the benchmark finishes, it reports the computational time it took to perform the merge sort of the shuffled input set as described in the README.md of the project. Below is the output of one such run:


N = 100000000
Merge Sort Sample
[0] all tests
[1] serial
[2] OpenMP Task
0

Running all tests

Serial version:
Shuffling the array
Sorting
Sort succeeded in 11.9732 seconds.

OpenMP Task Version:
Shuffling the array
Sorting
Sort succeeded in 3.17086 seconds.

Mining ICC flags with Optimizer Studio

Our goal is to see whether better-performing flags can be found, and for this task we’ll use Optimizer Studio. The first step is to write the definition file for Optimizer Studio.

  • Optimizer Studio has embedded support for ICC flags, so we import them.
  • We define two metrics to track the duration of each of the versions, the serial and the parallel.
  • Since our setup may have variability, we specify to Optimizer Studio to sample each configuration at least twice.

The definition file is below:


import:
  optimizer.compilers.icc:

domain:
  common:
    baseline_config: baseline.json#best_knob_config
    metrics:
      avg time of serial version [ms]:
        kind: file
        path: /tmp/avg_time_serial
      avg time of OpenMP Task version [ms]:
        kind: file
        path: /tmp/avg_time_openmp
    target: avg time of serial version [ms]:min

global_settings:
  min_baseline_samples: 2
  min_samples_per_config: 2

Next, we write a script that will be invoked by Optimizer Studio in each iteration. The script performs the following tasks:

  • Receive the compiler flags as environment variables ${KNOB_VALUES[@]}
  • Recompile the benchmark if necessary with the new flags. Check if the binary was created successfully, return an error otherwise. We used icpc (ICC) 19.0.4.227 20190416.
  • Run the benchmark
  • Parse the benchmark output and provide the metrics to Optimizer Studio via files

After we have the workload script and the definition file, we run the optimization on an Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz. Optimizer Studio attempted to find a better configuration using 243 different flags, that represent a parameter space of 1.3*10^138. In the optimization process, Optimizer Studio attempts different configurations and uses the workload script to recompile, run and measure the benchmark’s performance. Eventually, Optimizer Studio achieved a 12.7% performance boost for the serial version of the benchmark, as shown in the following table:

Performance metricsBaselineTunedSpeedup
avg time of serial version [sec]11.951510.6031.127x
avg time of OpenMP Task version [sec]2.4032.1211.132x

Achieving the 12.7% boost took 25 hours. However, the improvement was achieved gradually: 5% was reached after only one hour and 7.7% after only 2 hours.

In the graph below, each data point represents the duration of the serial version reported by a binary that was compiled with a certain set of compiler flags. The green data point represents the baseline configuration, and the red data points represent configurations that achieved the lowest duration so far.

Running this optimization with Optimizer Studio allows us to plot insightful graphs of the configuration space like the one shown below, where the vertical axis is the time for the serial version and the horizontal is for the parallel version:

The configurations across straight lines most probably represent flags that impact both the parallel and the serial versions. The configurations across parallel lines most probably represent flags that impact only the parallel version. The configurations across the Pareto frontier (minimizing both the serial and the parallel versions) are represented by red data points.

The best-performing configuration had the following 116 flags:


-ansi -ansi-alias-check -auto-ilp32 -axSSE3 -Bdynamic -Bsymbolic -C -c -check-pointers-dangling=none -dD -debug none -fabi-version=1 -falign-functions=1 -fargument-alias -femit-class-debug-always -ffunction-sections -finline-functions -finline-limit=987 -fjump-tables -fma -fno-align-loops -fno-common -fno-eliminate-unused-debug-types -fno-exceptions -fno-friend-injection -fno-implicit-templates -fno-inline -fno-optimize-sibling-calls -fno-pic -fprotect-parens -ftemplate-depth=6 -ftls-model=global-dynamic -gdwarf-2 -global-hoist -hotpatch=1 -inline-level=0 -ipo -ipo-c -ipo-jobs11 -ipo-separate -ipp=nonpic -ipp-link=static -Kc++ -malign-double -mauto-arch=SSSE3 -mbranches-within-32B-boundaries -mconditional-branch=pattern-report -minstruction=movbe -MMD -mregparm=2 -mstringop-inline-threshold=1 -mtune=pentium -multiple-processes=8 -no-alias-const -no-fast-transcendentals -no-fp-port -no-inline-calloc -no-inline-max-per-compile -no-inline-min-size -no-parallel-source-info -no-prof-data-order -no-prof-gen -no-restrict -no-sox -notraceback -no-vec -par-runtime-control3 -par-schedule-guided-analytical -par-threshold97 -pie -prec-sqrt -prof-func-groups -prof-func-order -prof-hotness-threshold=92 -qno-offload -qno-opt-assume-safe-padding -qno-opt-class-analysis -qno-opt-jump-tables -qno-opt-multi-version-aggressive -qno-opt-subscript-in-range -qopenmp -qopenmp-lib=compat -qopenmp-simd -qopenmp-stubs -qopt-block-factor=67 -qopt-calloc -qopt-malloc-options=3 -qopt-mem-layout-trans=0 -qopt-prefetch=0 -qopt-prefetch-distance=42 -qopt-prefetch-issue-excl-hint -qopt-ra-region-strategy=trace -qopt-report=0 -qopt-report-annotate=html -qopt-report-embed -qopt-report-format=vs -qopt-report-names=mangled -qopt-report-per-object -qopt-streaming-stores=never -qoverride-limits -qsimd-honor-fp-model -qsimd-serialize-fp-reduction -scalar-rep -shared -simd -simd-function-pointers -static-libstdc++ -strict-ansi -undef -unroll=14 -V -vecabi=compat -vec-guard-write -vec-threshold99 -xSSE2

At least in this merge sort example, the Intel compiler behaves on par with other compilers like GCC and LLVM, where a careful or calculated choice of compiler flags can yield 5%-20% of improvement in performance over the trivial -O3 and -ipo flags. Therefore, for maximum performance, it is not enough to just purchase a commercial compiler – it is necessary to also mine the compiler flags.

Contact us

Have any lingering questions or need a walkthrough? Send us a message and we’ll get back to you.

Please Contact To Download Concertio HFT One Pager

Download Mellanox Solution Brief

Schedule A Demo

Let’s set-up a time to show how to unleash your system’s full potential.