Selecting the right compiler flags for Coremark on Marvell’s ThunderX2 Processor
One of the first things you learn in Compilers 101 is that compiler optimizations can do wonders to speed up binaries. Although it has been a very active field of research for decades, compiler innovations are still being made from great compiler research groups around the world. While generating a binary from source code is complex enough, generating an efficient binary is a whole different story. There are countless optimizations that can be applied, but often their effectiveness can only become known at runtime. This is why choosing the right compiler flags is important, and even more so when trying to optimize on new CPUs and architectures, as optimizations that are effective on one architecture might be ineffective, or worse, detrimental in other architectures. Compilers have countless flags to choose from, so selecting the right ones becomes difficult. Sure, “-O3” performs much better than baseline, but more a precise combination of flags can yield even greater performance.
[/et_pb_text][et_pb_testimonial author=”Larry Wikelius” job_title=”Vice President SW Ecosystem and Solutions Group” company_name=”Marvell” _builder_version=”3.15″]
Concertio Optimizer Studio is a great addition to ThunderX2’s growing ecosystem as it can automate compiler flag mining of existing and newly-ported applications and improve their performance.
Case study: Optimizing Coremark Compiler Flags on Marvell’s ThunderX2 Processor
Our friends at Marvell® wanted to check whether the process of selecting compiler flags for binaries running on their flagship ThunderX2® 64-bit Arm® v8-A processor can be automated. As an example application, the EEMBC® Coremark® benchmark was chosen, which measures CPU performance.
In focus were 29 gcc compiler flags, representing a parameter space of 7.88*1021 possible configurations. Performing an exhaustive search over all the possible configurations is impossible in our lifetimes, so manually optimizing the compiler flag configuration requires expertise, experience and knowledge of how to prune the parameter space. The trivial first step is to use the “-Ofast” flag, which enables all “-O3” optimizations, and then some. This flag by itself provides 5.2x speedup over no flags at all, so it was chosen as a baseline. Then, after an optimization effort which took many hours of experimentation by experienced engineers, a recommended compiler flag configuration for Coremark on the ThunderX2 processor was developed. Indeed, it performs better than the baseline “-Ofast” flag by 7.6%. The question is, can we do better than 7.6%? Can we do it with less engineering effort? To answer these questions, an automated approach was evaluated using Concertio’s Optimizer Studio. Running Optimizer Studio directly on the ThunderX2 64-bit Arm® v8-A processor was not an option since it supports running only on x86 architectures. Instead, Optimizer Studio was installed inside a virtual machine on an x86 server. In this type of setup, any device that supports remote configuration can be optimized.
Optimizer Studio was configured for synchronous sampling mode, which means that it iteratively runs workloads to completion, while alternating the compiler flags between each run. In each iteration, Optimizer Studio selects a configuration of compiler flags. This configuration is then written into files in the ThunderX2 Server using ssh. After that, Optimizer Studio invokes the workload.sh script which resides on the same VM. This script uses ssh to remotely invoke the remote_workload.sh script on the ThunderX2 Server. The remote script compiles Coremark, runs it, and reports Coremark’s performance (the iterations per second it performed) back to Optimizer Studio. In order to deal with the variability of Coremark’s performance, Optimizer Studio was configured to perform at least two samples per configuration. The optimization process took 3 hours and 9 minutes, with each Coremark run taking approximately 30 seconds. Optimizer Studio tested a very small subset of the possible compiler flag configurations: 288 different configurations using 515 samples, out of a total number of 7.88*1021 possible configurations. Nevertheless, Optimizer Studio was able to find a configuration that showed 10.5% speedup over “-Ofast”, exceeding the 7.6% achieved by manual tuning. The Coremark performance during optimization is detailed in the below graph:
The settings that maximized the optimization were the following:
CASE_VAL_THRES: 60 MAX_CSE_INSNS: 1000 MAX_INL_REC_DEPTH: 12 PORT_CFLAGS: -O3 CASE_VAL_THRES: 80 FALIGN_LBLS: DISABLED FMOD_SCHED: DISABLED GCSE_UNRES_COST: 7 MAX_INL_INSNS_AUTO: 180 MAX_INL_INSNS_REC: 500 MAX_INL_REC_DEPTH_AUTO: 9 MCPU: DISABLED
Shay Gal-On, Principal Engineer at Marvell and a Performance Analysis and Optimization Professional, summarizes the experiment: “Concertio’s Optimizer Studio proved to be an effective tool for optimizing benchmarks such as Coremark on the ThunderX2 processor. This kind of automation can help divert our engineering resources from tedious parameter search to other important performance optimization tasks.”
Larry Wikelius, Vice President SW Ecosystem and Solutions Group, adds: “Concertio Optimizer Studio is a great addition to ThunderX2’s growing ecosystem as it can automate compiler flag mining of existing and newly-ported applications and improve their performance.”
Go on, mine the flags!
Compiler flag mining is an important task in performance optimization of applications, and Optimizer Studio can be used to discover good-performing flags automatically.
Since we developed Optimizer Studio with flexibility in mind, features like remote optimization, supporting diverse architectures, defining new tunables, and dealing with noisy environments, are all native to our tools. It’s exciting to see how our friends from Marvell used all of these capabilities to their benefit.
Marvell and ThunderX2 are registered trademarks of Marvell and/or its affiliates.