Skip to content

Parallel Optimization mode

Overview

Oftentimes each experiment testing a knob configuration runs for a long time - hours, or even days.
Optimizer Studio is able to distribute the workload between multiple computing nodes, testing different knob configurations in parallel. This allows to both speed-up the optimization process and better utilize the available hardware resources.

Pitfalls

The parallel processing is not without its pitfalls:

  1. Optimizer Studio explores concurrently only so much configurations. By default the number of configurations being explored at once is less than 10, each configuration will be measured at least min_samples_per_config times. This poses a practical limit on the number of parallel computing nodes as ~7 x min_samples_per_config.

  2. The computers running the workload should be very similar and provide a similar performance results for the same knob configuration, with reasonable standard deviation. Otherwise, Optimizer Studio will not be able to distinguish between an improvement achieved due to configuration choice and better performing hardware.

Parallel Optimization Workflow

  1. Select main (admin) node to run Optimizer Studio on and one or more worker nodes. The main node can double up as a worker node.
  2. Install Optimizer Studio package on each computer - admin and worker nodes. Notice, the license must be activated on the admin node only.
  3. Install the knob file(s) and workload script(s) on each node, in the same location, so that they can be accessed via the same absolute path.
  4. Arrange a password-less ssh connection from the admin node to the worker nodes, so that it doesn't require login credentials interactively. The recommended way is to use SSH keys with no passphrase.
  5. Enable HTTP access to the admin node, so that worker nodes can communicate with the optimization engine via REST API. The default HTTP port is 8421. The user can change the HTTP port via command line switch optimizer-ctl init ... --http-port=8421
  6. Add parallel subsection to the knob file workload section.
  7. Optimizer Studio invokes the workload starter using the parameters passed in the parallel subsection. Workload starter invokes optimizer-studio.worker on worker node(s), e.g. remotely via ssh.

Optimizer Studio Configuration in Parallel Mode

Workload Starter Script(s)

Optimizer Studio comes with two different starter scripts located in the Optimizer Studio installation directory:

  • optimizer-studio.starter.ssh
  • optimizer-studio.starter.local

The users can modify these scripts to meet their environment requirements.

Each starter script accepts its own parameters:
optimizer-studio.starter.ssh script accepts the list of IP addresses of the worker nodes, while
optimizer-studio.starter.local script accepts the number of (local) worker processes.

The purpose of the starter script is to initiate all the worker processes and exit - the worker processes are expected to keep running on their own.

Worker process(es)

The worker process can be initiated by either the starter script or in any other way, if appropriate.
The worker process runs until the communication channel with the admin node goes down, then it initiates self-shutdown.

knobs.yaml Configuration File

Workload Section

To set up Optimizer Studio in parallel mode, parallel subsection has to be added to the normal workload section (workload declarative definition).
Example [ssh]:

global_settings:
  ...
  pending_config_timeout: 0
  http_port: 8421
  http_buffer_size: 20480
  http_retry_limit: 5

domain:
  common:
    knobs:
    ...
    metrics:
      result:
        kind: file
        path: /tmp/result.${WORKER_ID}
    target: result:max

workload:
  kind: sync
  parallel:
    mode: ssh
    workers:
      - user@host1[:port]
      - user@host2[:port]

  run:
    command: |
      echo "{{A + B + C}}" > /tmp/result.${WORKER_ID}

Example [local]:

    ...
    target: result:max

workload:
  kind: sync
  parallel:
    mode: local
    num_workers: 2

  run:
    command: |
      echo "{{A + B + C}}" > /tmp/result.${WORKER_ID}

Parallel-specific configuration parameters

Parallel mode employs normal knobs.yaml file. This section relates to parameters relevant mainly to Parallel mode.

pending_config_timeout

Configuration attempt scheduling policy.
pending_config_timeout: 0
The configurations will be attempted sequentially. E.g. configurations A, B, C, D would be attempted AAABBBBCCDDD.
This policy is appropriate for singular operating mode.

pending_config_timeout: T
(T is a time specification, e.g. 1m30s or 1h30m)

This policy is appropriate for parallel operating mode.
The configurations will be attempted interleaved with timeout T.
For example, given configurations A, B, C, D and min_samples_per_config = 2, the configurations would be attempted AABBCCDDABDB. In case more than T time passed between consecutive attempts of the same configuration, this configuration will be retired, and the next configuration attempted.
Since the purpose of the timeout is to prevent optimization from being stuck on the same configuration forever, its value should be selected with particular slack:
T = ~10(#configurations) x min_samples_per_config x max-execution-time.

http_port

Override the default HTTP port (8421).

http_buffer_size

HTTP buffer size [bytes] used, among other things, for passing the knob configuration to the worker node. In case of longer knob lists, this value might need be increased beyond the default 20480 bytes.

http_retry_limit

The maximum number of retries, a worker attempts in case of HTTP transaction failure, prior to declaring the communication channel with the admin DOWN, and initiating self-shutdown.