Writing a Hetero-C++ Program and Compiling it with HPVM

We will write a simple Matrix Multiplication kernel to illustrate how to compile a program through HPVM to target CPU, GPU or the FPGA. The implementation will be written using Hetero-C++, a parallel dialect of C/C++ which describes hierarchical Task-level and Data-level parallelism and compiles through HPVM.

Writing a Program in Hetero-C++

We start with a scalar implementation of Matrix Multiplication written in C++.

#include <heterocc.h>

void matmul(int *Res, std::size_t Res_Size, int *V1, std::size_t V1_Size,
            int *V2, std::size_t V2_Size, std::size_t left_dim,
            std::size_t right_dim, std::size_t common_dim) {
  for (int i = 0; i < left_dim; i++) {
    for (int j = 0; j < right_dim; j++) {
      Res[i * right_dim + j] = 0;
      for (int k = 0; k < common_dim; k++) {
        // Res[i,j] += V1[i,k] + V2[k,j]
        Res[i * right_dim + j] +=
            V1[i * common_dim + k] * V2[k * right_dim + j];
      }
    }
  }
}

Here we can see that this implementation is described using a 3 level loop nest, where the outer two loops iterate over index variables i and j and the innermost loop iterates over loop index variable k. We can see that the iterations over i and j are independent of each other and hence can be executed in parallel to each other. The inner-most loop however is performing a reduction operation and can not be paralleled with HPVM. We can describe the above information in Hetero-C++ with the __hetero_parallel_loop marker function, as shown below.

#include <heterocc.h>

void matmul(int *Res, std::size_t Res_Size, int *V1, std::size_t V1_Size,
            int *V2, std::size_t V2_Size, std::size_t left_dim,
            std::size_t right_dim, std::size_t common_dim) {
  for (int i = 0; i < left_dim; i++) {
    for (int j = 0; j < right_dim; j++) {
      __hetero_parallel_loop(
          /* Num Parallel Enclosing Loops */ 2,
          /* Num Input Pairs */ 6, Res, Res_Size, V1, V1_Size, V2, V2_Size,
          left_dim, right_dim, common_dim,
          /* Num Output Pairs */ 1, Res, Res_Size,
          /* Optional Node Name */ "matmul_parallel_loop");

      Res[i * right_dim + j] = 0;

      for (int k = 0; k < common_dim; k++) {
        // Res[i,j] += V1[i,k] + V2[k,j]
        Res[i * right_dim + j] +=
            V1[i * common_dim + k] * V2[k * right_dim + j];
      }
    }
  }
}

The marker describes that the 2 enclosing loops over i and j are parallel to each other. Additionally, it describes what are the inputs and outputs to the body of the loop which will be parallelized.

To complete the specification of the program, we add marker calls to __heterero_section_begin/end to denote that the region of code consists of at least one HPVM computational node. We additionally wrap the loop nest in __hetero_task_begin/end markers to nest the computation inside a node. This is needed to enable compilation to the GPU.

The complete specification of the matmul function is shown below:

#include <heterocc.h>

void matmul(int *Res, std::size_t Res_Size, int *V1, std::size_t V1_Size,
            int *V2, std::size_t V2_Size, std::size_t left_dim,
            std::size_t right_dim, std::size_t common_dim) {
  void *Section = __hetero_section_begin();

  void *wrapper = __hetero_task_begin(
      /* Num Input Pairs */ 6, Res, Res_Size, V1, V1_Size, V2, V2_Size,
      left_dim, right_dim, common_dim,
      /* Num Output Pairs */ 1, Res, Res_Size,
      /* Optional Node Name */ "matmul_parallel_loop_wrapper");

  void *_Section = __hetero_section_begin();

  for (int i = 0; i < left_dim; i++) {
    for (int j = 0; j < right_dim; j++) {
      __hetero_parallel_loop(
          /* Num Parallel Enclosing Loops */ 2,
          /* Num Input Pairs */ 6, Res, Res_Size, V1, V1_Size, V2, V2_Size,
          left_dim, right_dim, common_dim,
          /* Num Output Pairs */ 1, Res, Res_Size,
          /* Optional Node Name */ "matmul_parallel_loop");

      __hetero_hint(/* TARGET DEVICE */ DEVICE );

      Res[i * right_dim + j] = 0;

      for (int k = 0; k < common_dim; k++) {
        // Res[i,j] += V1[i,k] + V2[k,j]
        Res[i * right_dim + j] +=
            V1[i * common_dim + k] * V2[k * right_dim + j];
      }
    }
  }
  __hetero_section_end(_Section);
  __hetero_task_end(wrapper);

  __hetero_section_end(Section);
}

The above description defines the HPVM Dataflow Graph (DFG) which will be compiled through HPVM to target heterogenous compute devices. DEVICE corresponds to the MACRO which can be defined to be either CPU_TARGET, GPU_TARGET or FPGA_TARGET` during compilation. To actually invoke the DFG with specific arguments, we write the following host code:

printf("Launching matrix multiply dataflow graph!\n");
void *MatMulDFG = __hetero_launch(
    /* Root Function */
    (void *)matmul,
    /* Num Input Pairs */ 6, C, C_Size, A, A_Size, B, B_Size, left_dim,
    right_dim, common_dim,
    /* Num Output Pairs */ 1, C, C_Size);

// Blocking call which waits
// for the execution of MatmulDFG to complete.
__hetero_wait(MatMulDFG);
printf("DFG finished executing!\n");

The __hetero_launch call generates the host code to prepare the arguments to be passed into the HPVM DFG as well as generates HPVM Runtime method calls for managing memory between compute devices.

Compiling the Program from Hetero-C++ to HPVM-C

To compile the above program, we first generate the LLVM IR Bitcode for the C++ file, using the following command:

$LLVM_BUILD_DIR/bin/clang -fno-discard-value-names -DDEVICE={CPU_TARGET,GPU_TARGET,FPGA_TARGET} -O1 -S -emit-llvm -I../include/ src/matmul.cc -o src/matmul.ll

Then, we run the generated bitcode file through the Hetero-C++ frontend which converts the input program into HPVM-C representation which can be compiled through HPVM. HPVM-C is the low level representation to describe HPVM programs which directly maps to intrinsics in HPVM.

export HPVM_DECLS_FILE=$LLVM_BUILD_DIR/tools/hpvm/projects/hetero-c++/lib/HPVMCFunctionDeclarations/HPVMCFunctionDeclarations.bc
$LLVM_BUILD_DIR/bin/hcc -declsfile $HPVM_DECLS_FILE -dot-dfg -o src/matmul-extract.bc src/matmul.ll

The above commands would generate two files, namely src/matmul-extract.bc which describes the HPVM Dataflow Graph program in HPVM-C which will be compiled through HPVM. Additionally the -dot-dfg flag dumps a dot file which can be exported to a PNG or PDF to visualise the Dataflow Graph.

Compiling the HPVM-C program through HPVM using hpvm-clang

Compiling to CPU

To compile the matrix multiplication program to the CPU we run the following command:

hpvm-clang -DDEVICE=CPU_TARGET --hetero-cc --hpvm-target cpu src/matmul.cc  src/matmul.cpu

The above command would run the Hetero-C++ frontend and the HPVM Backend transformations to generate the executable src/matmul.cpu. The -DDEVICE preprocessor directive simply sets the __hetero_hint argument to specify the node should be compiled to the CPU.

Compiling to GPU

To compile the matrix multiplication program to the GPU we run the following command:

hpvm-clang -DDEVICE=GPU_TARGET --hetero-cc --hpvm-target gpu src/matmul.cc  src/matmul.gpu

Compiling to FPGA

To compile the matrix multiplication program to the FPGA we run the following command:

hpvm-clang -DDEVICE=FPGA_TARGET --hetero-cc --hpvm-target fpga src/matmul.cc  src/matmul.FPGA

The above command would run the Hetero-C++ frontend and the HPVM Backend transformations to generate the executable src/matmul.fpga along with the OpenCL kernel which will execute on the FPGA.

Compiling the HPVM-C program through HPVM

This section provides details of executing the individual passes seperately for each of the stages in the specific HPVM targets. We provide this as a reference for those who are interested.

Compiling to CPU

To compile the matrix multiplication program to the CPU we run the following commands:

$LLVM_BUILD_DIR/bin/opt -enable-new-pm=0 -load $LLVM_BUILD_DIR/lib/HPVMIRCodeGen.so -load $LLVM_BUILD_DIR/lib/HPVMTargets.so -genhpvm -globaldce -hpvm-timers-gen   -dfg2llvm-cpu -clearDFG -hpvm-timers-cpu -hpvm-timers-ptx -S src/matmul-extract.bc -o src/matmul-extract.hpvm.host.ll
$LLVM_BUILD_DIR/bin/llvm-link src/matmul-extract.hpvm.host.ll $LLVM_BUILD_DIR/tools/hpvm/projects/hpvm-rt/hpvm-rt.bc -S -o matmul-extract.linked.ll
$LLVM_BUILD_DIR/bin/clang++ -O3  -lm -lpthread -lrt -lOpenCL -L$CUDA_TOOLKIT_PATH/lib64 matmul-extract.linked.ll -o matmul_seq

Compiling to GPU

To compile the matrix multiplication program to the GPU we run the following commands:

$LLVM_BUILD_DIR/bin/opt -enable-new-pm=0 -load $LLVM_BUILD_DIR/lib/HPVMIRCodeGen.so -load $LLVM_BUILD_DIR/lib/HPVMTargets.so -genhpvm -globaldce -hpvm-timers-gen  -dfg2llvm-gpu-ocl -dfg2llvm-cpu -clearDFG -hpvm-timers-cpu -hpvm-timers-ptx -S src/matmul-extract.bc -o src/matmul-extract.hpvm.host.ll
$LLVM_BUILD_DIR/bin/llvm-link src/matmul-extract.hpvm.host.ll $LLVM_BUILD_DIR/tools/hpvm/projects/hpvm-rt/hpvm-rt.bc -S -o matmul-extract.linked.ll
$LLVM_BUILD_DIR/bin/clang++ -O3  -lm -lpthread -lrt -lOpenCL -L$CUDA_TOOLKIT_PATH/lib64 matmul-extract.linked.ll -o matmul_gpu
$LLVM_BUILD_DIR/bin/llvm-ocl  src/matmul-extract.kernels.ll -o src/matmul-extract.kernels.cl

Compiling to FPGA

To compile the matrix multiplication program to the FPGA we run the following commands:

$LLVM_BUILD_DIR/bin/hpvm2fpga -emu src/matmul-extract.bc