hpvm-dse provides a DSE workflow allowing the optimization of HPVM programs,
supporting any language that can target the IR (currently HPVM-C and Hetero-C++).
The framework uses HyperMapper (HM) <https://github.com/luinardi/hypermapper>
to explore the design space.
At a high level,
hpvm-dse follows the following steps:
Read in the input program and extract potential optimizations and their tunable parameters.
Provide sample evaluations for HyperMapper’s “design of experiment” phase, where random samples are generated and evaluated to initialize a surrogate cost model. Each sample specifies a value for each tunable parameter, which
hpvm-dseuses to optimize the input program before evaluating its cost.
Provide evaluations for HM’s optimization phase, where its cost model is incrementally updated to improve accuracy.
Report the samples with the best optimization objective value.
Optionally, synthesize these designs.
The Input Program¶
The input to
hpvm-dse is an LLVM IR file (
.bc) in the format
that would be obtained by using hcc to compile a Hetero-C++ program into LLVM IR.
A basic invocation of
hpvm-dse might look like this:
hpvm-dse input.ll -o=outdir --dse-iter=100 --timeout=300 --reps=5
This will perform 5 runs of DSE with default optimization parameters, with 100
optimization steps, and a timeout of 300 seconds for each sample evaluation.
-o specifies the output directory, and will create the following structure:
output_dir/Rep.0/Sample.0 output_dir/Rep.0/Sample.1 output_dir/Rep.0/Sample.2 output_dir/Rep.1/Sample.0 output_dir/Rep.1/Sample.1 output_dir/Rep.1/Sample.2 ...
Rep.i contains the output for the
i'th parallel DSE run as specified
Sample.j contains the output for the
evaluated in each run. By default, each repetition runs in parallel.
As each DSE sample is evaluated, the tool produces an output directory containing the build artifacts of the sample. This includes:
LLVM host code - usually
LLVM kernel code - usually
OpenCL kernel code - usually
--synth is specified,
hpvm-dse will invoke AOC to perform full
synthesis on the best samples from each repetition. This generates a
.aocx design file within the sample directory, and the associated
.host.ll file will be hard-coded to load the design from the design’s path.
The easiest way to use the design is to first compile the original program with
HPVM’s FPGA backend, then copy the optimized host code to your build directory.
Building again will link with the new host code file and produce an executable
that will use the design. Using a custom host executable is necessary since
inter-kernel optimizations like Node Fusion may change the design’s OpenCL
interface by changing what kernels are present and the order in which they need
to be executed.
-o=<dir>- Set the top level output directory.
--synth- Turn on synthesis of best reported points.
--emu- Synthesize and compile host code to work with device emulation, instead of full synthesis.
-b=<board>- Specify the board - defaults to A10GX.
-csv- Generates a CSV file in each repetition directory, holding a table describing each evaluated sample’s parameters and objective values.
--always-overwrite- Delete existing output directory without prompting.
--dse-iter=<num>- How many samples to evaluate during the optimization phase.
--doe-multiplier=<N>- By default, the number of samples in the design of experiment phase is the number of extracted parameters + 1. This option can specify a custom sample count for DoE, as a multiple of the parameter count.
--reps=<n>- How many full DSE runs to perform.
-j=<n>- How many jobs to use for running repetitions in parallel. Defaults to the number of reps.
--rnd- Use uniform random sampling for DoE.
--slh- Use Standard Latin Hypercube sampling.
--timeout=<t>- The maximum number of seconds an evaluation may take before being killed.
--util-threshold=<n>- The maximum resource utilization, in percent, before rejecting a sample as invalid. This defaults to 95%.
--tlp- Enables the corresponding optimization. By default, all are enabled, but specifying these flags will make all optimizations opt-in.
--max-dim-uf=<n>- The maximimum unroll factor for HPVM node replication factors
--max-dim-uf-options=<n>- The maximum number of possible values to try for each unroll factor.
--max-uf-options- Similar to the above, but for loops within leaf node function bodies
--max-uf-options limit the number of possible factors to the
lower of the two. For example, if
set, and a loop runs for 100 iterations, the first option would limit the
possible unroll factors to 1, 2, 4, 5, 10, and 20. The second limits this to the
first 5 factors, so 20 would not be included in the range of unroll factors.
These options are useful for pruning design points that are likely to not fit in
the resource budget, or to speed up convergence by reducing the search space.
hpvm-dse by default uses a built-in execution time model as the optimization
metric for the output program, but it also allows the user to specify a custom
scoring script using the flag
value should be such that executing the following command:
./path/to/script.sh <SampleDir> <Rep>
prints out the custom metric value as a floating point number.
contains the host and kernel LLVM IR files generated by the backend passes for
the selected targets.
Rep is the ID of the DSE run the sample was generated
for, in case multiple runs are being performed in parallel.
This is a possible example script for evaluating empirical execution time for the HPVM-CAVA benchmark on the GPU:
sampledir=$1 rep=$2 cd /path/to/benchmarks/hpvm-cava/ TARGET=gpu VERSION=_EVAL$rep make > /dev/null cp $sampledir/* build/gpu_EVAL$rep > /dev/null TARGET=gpu VERSION=_EVAL$rep make > /dev/null ./cava-hpvm-gpu_EVAL$rep example-tulips/raw_tulips.bin example-tulips/eval$rep | grep "Kernel Execution" | grep -oE [0-9.]+
The script builds the program, copies the build files from
builds again to link against the optimized kernels. It then runs the benchmark
and extracts the running time from the output.
Rep is used to separate the
build files from other DSE runs that may be executing at the same time.