Using hpvm-dse

hpvm-dse provides a DSE workflow allowing the optimization of HPVM programs, supporting any language that can target the IR (currently HPVM-C and Hetero-C++). The framework uses HyperMapper (HM) <> to explore the design space.

Tool Flow

At a high level, hpvm-dse follows the following steps:

  1. Read in the input program and extract potential optimizations and their tunable parameters.

  2. Provide sample evaluations for HyperMapper’s “design of experiment” phase, where random samples are generated and evaluated to initialize a surrogate cost model. Each sample specifies a value for each tunable parameter, which hpvm-dse uses to optimize the input program before evaluating its cost.

  3. Provide evaluations for HM’s optimization phase, where its cost model is incrementally updated to improve accuracy.

  4. Report the samples with the best optimization objective value.

  5. Optionally, synthesize these designs.


The Input Program

The input to hpvm-dse is an LLVM IR file (.ll or .bc) in the format that would be obtained by using hcc to compile a Hetero-C++ program into LLVM IR.

Invoking hpvm-dse

A basic invocation of hpvm-dse might look like this:

hpvm-dse input.ll -o=outdir --dse-iter=100 --timeout=300 --reps=5

This will perform 5 runs of DSE with default optimization parameters, with 100 optimization steps, and a timeout of 300 seconds for each sample evaluation. -o specifies the output directory, and will create the following structure:


Each Rep.i contains the output for the i'th parallel DSE run as specified by --reps. Each Sample.j contains the output for the j'th sample evaluated in each run. By default, each repetition runs in parallel.

As each DSE sample is evaluated, the tool produces an output directory containing the build artifacts of the sample. This includes:

  • LLVM host code - usually

  • LLVM kernel code - usually main.hpvm.ll.kernels.ll

  • OpenCL kernel code - usually


If --synth is specified, hpvm-dse will invoke AOC to perform full synthesis on the best samples from each repetition. This generates a .aocx design file within the sample directory, and the associated .host.ll file will be hard-coded to load the design from the design’s path. The easiest way to use the design is to first compile the original program with HPVM’s FPGA backend, then copy the optimized host code to your build directory. Building again will link with the new host code file and produce an executable that will use the design. Using a custom host executable is necessary since inter-kernel optimizations like Node Fusion may change the design’s OpenCL interface by changing what kernels are present and the order in which they need to be executed.

Common Options


  • -o=<dir> - Set the top level output directory.

  • --synth - Turn on synthesis of best reported points.

  • --emu - Synthesize and compile host code to work with device emulation, instead of full synthesis.

  • -b=<board> - Specify the board - defaults to A10GX.

  • -csv - Generates a CSV file in each repetition directory, holding a table describing each evaluated sample’s parameters and objective values.

  • --always-overwrite - Delete existing output directory without prompting.

DSE Parameters

  • --dse-iter=<num> - How many samples to evaluate during the optimization phase.

  • --doe-multiplier=<N> - By default, the number of samples in the design of experiment phase is the number of extracted parameters + 1. This option can specify a custom sample count for DoE, as a multiple of the parameter count.

  • --reps=<n> - How many full DSE runs to perform.

  • -j=<n> - How many jobs to use for running repetitions in parallel. Defaults to the number of reps.

  • --rnd - Use uniform random sampling for DoE.

  • --slh - Use Standard Latin Hypercube sampling.

  • --timeout=<t> - The maximum number of seconds an evaluation may take before being killed.

  • --util-threshold=<n> - The maximum resource utilization, in percent, before rejecting a sample as invalid. This defaults to 95%.


  • --bufferin, --argpriv, lunroll, --lfusion, --nfusion, --tlp - Enables the corresponding optimization. By default, all are enabled, but specifying these flags will make all optimizations opt-in.

  • --max-dim-uf=<n> - The maximimum unroll factor for HPVM node replication factors

  • --max-dim-uf-options=<n> - The maximum number of possible values to try for each unroll factor. --max-dim-uf

  • --max-uf, --max-uf-options - Similar to the above, but for loops within leaf node function bodies

--max-uf and --max-uf-options limit the number of possible factors to the lower of the two. For example, if --max-uf=20 and --max-uf-options=5 were set, and a loop runs for 100 iterations, the first option would limit the possible unroll factors to 1, 2, 4, 5, 10, and 20. The second limits this to the first 5 factors, so 20 would not be included in the range of unroll factors. These options are useful for pruning design points that are likely to not fit in the resource budget, or to speed up convergence by reducing the search space.

Custom Evaluators

hpvm-dse by default uses a built-in execution time model as the optimization metric for the output program, but it also allows the user to specify a custom scoring script using the flag --custom-evaluator=./path/to/ The value should be such that executing the following command:

./path/to/ <SampleDir> <Rep>

prints out the custom metric value as a floating point number. SampleDir contains the host and kernel LLVM IR files generated by the backend passes for the selected targets. Rep is the ID of the DSE run the sample was generated for, in case multiple runs are being performed in parallel.

This is a possible example script for evaluating empirical execution time for the HPVM-CAVA benchmark on the GPU:


cd /path/to/benchmarks/hpvm-cava/

TARGET=gpu VERSION=_EVAL$rep make > /dev/null
cp $sampledir/* build/gpu_EVAL$rep > /dev/null
TARGET=gpu VERSION=_EVAL$rep make > /dev/null

./cava-hpvm-gpu_EVAL$rep example-tulips/raw_tulips.bin example-tulips/eval$rep | grep "Kernel Execution" | grep -oE [0-9.]+

The script builds the program, copies the build files from SampleDir, and builds again to link against the optimized kernels. It then runs the benchmark and extracts the running time from the output. Rep is used to separate the build files from other DSE runs that may be executing at the same time.