Did you know that HPC clusters can be optimized automatically?
Tuning HPC applications can be very difficult. There are many settings in the applications, the MPI stack, and the underlying OS and hardware, making it next to impossible for performance engineers to fully optimize their applications and systems. Tools that employ machine learning techniques, such as those used by the Concertio Optimizer Studio product, can help get better results quicker.
This blog post will dive into how and why AI and machine learning can help to optimize HPC clusters automatically and improve HPC cluster performance.
The Challenges of Manual Performance Tuning
As a performance engineer, you understand the complexities of today’s HPC systems. Not only is each HPC cluster custom in nature, but they’ve developed from having a handful of knobs to hundreds of knobs, making it unwelcoming and inefficient for performance engineers to spend their time fully optimizing every available tunable.
Illustration of the increase in the number of tunable settings over time, well exceeding the human limit
Many times, the decisions made for performance tuning come down to what the performance engineer believes to be more influential on the current workload along with what they can achieve with their specialized knowledge (i.e., if you know a lot about CPUs, you specialize in optimizing the CPU settings).
As a result, only a small number of tunables per workload can be explored, and many of the HPC’s tunable settings are often left on their default settings, leaving the machine’s full potential unused.
Another challenge with manual performance tuning is that as workloads and inputs change over time, performance tuning needs to be performed on a regular basis. This is another drainage of time and resources.
How Exactly Does Machine Learning Work to Optimize Performance Tuning?
Reinforcement learning is the key to optimizing HPC performance tuning. At the basic level, reinforcement learning works like trial and error. The software learns as it goes by acting, checking if the result was better or worse, and then moving forward based on the result.
Illustration of the reinforcement learning process
In the context of performance tuning, reinforcement learning is used to iteratively come up with a settings configuration and measure the performance of the HPC cluster with these settings, until either performance is maximized or the time budget for optimization has been exceeded. So instead of manually sweeping various parameters, the software automatically explores and reports the best configurations that were found.
How Does the Concertio Optimizer Studio Tool Work for HPC?
In the context of optimizing HPC clusters, Concertio Optimizer Studio works by iteratively running HPC applications. Using reinforcement learning, Optimizer Studio attempts certain configuration settings and sees the results. If the results are improving, then the software will decide to take this path and try to refine it. If not, it will try something else. At the end of the optimization run, the performance engineer will get a “grocery list” of the settings that can be applied to achieve near-optimal performance for the specific application that was tested.
What Are the Benefits of Using Machine Learning for HPC Cluster Tuning?
The benefits of a tool like Optimizer Studio for HPC cluster tuning are the potential speedups, in conjunction with time, energy, and costs savings associated with its implementation.
Not only are performance engineers free to focus their time and energy on higher-level HPC architecture optimization tasks, but the time-to-market also shortens as the tuning process takes less time. Furthermore, engineering fatigue due to performing mundane parameter search can be greatly reduced.
Implementing a machine-learning-based approach to automate performance tuning also allows HPC users to more easily take advantage of heterogeneous hardware and software components. Additionally, using machine-learning techniques, hidden system configurations that deliver superior application performance can be discovered.
Apart from using machine learning to automatically optimize system settings, Concertio’s Optimizer Studio product is highly configurable, allowing users to optimize just about any setting. From compiler flags that maximize a binary’s performance, application-based settings such as the number of OpenMP threads, the numerous MPI settings and libraries, operating system settings such as the various scheduler constants, networking settings such as in Mellanox ConnectX NICs, and even down to controlling CPU settings such as SMT or the LLC prefetchers. As long as users can write get and set shell scripts for settings, these can be plugged into Optimizer Studio for optimization.
The LINPACK Example
The LINPACK benchmarks are popular in the HPC world for measuring floating point performance. Below is the console output of a simple example of using Optimizer Studio to optimize LINPACK on a bare metal server:
While this example warrants a detailed blog for itself (stay tuned!), it demonstrates that optimizing numerous application and system settings can be performed automatically using machine learning techniques.
The Bottom Line – What to Expect
Implementing an automated approach to HPC Cluster optimization using machine learning can greatly improve performance. Speedups can even reach up to 5x in certain cases, it really depends on the “love” users apply to the tool. The more custom settings, the more to gain!