Writing a Hetero-C++ Program

The following document describes various programming patterns supported by the Hetero-C++ api. Please refer to the Hetero-C++ Language Specification for a detailed description of the invidiual api calls.

A Hetero-C++ program can be divided into 2 categories. Dataflow Code which describes the structure of the hierarchical dataflow graph, and the Host Code which describes the actual inputs passed to the Dataflow Graph (DFG) and the outputs requested after its execution completes.

Launching a Dataflow Graph

The HPVM framework describes parallelism via a hierarchical dataflow graph. The entry node to this graph refers to a Root Node. Consider the function MyRoot below, which we want as the root node of our Dataflow graph.

// Takes the memory pointed to by P of size Psz bytes
// and transforms its content.
void MyRoot(float* P, size_t Psz){
    // HPVM Dataflow Graph ...
}

To launch the Dataflow graph described by MyRoot, the corresponding host code should be created.

{
 // ... Previous code describing initialization
 // of (float* F, size_t F_size)

 void* dfg = __hetero__launch((void*) MyRoot,
    /* Number of Input Buffer Pairs */ 1, F, F_size,
    /* Number of Output Buffer Pairs */ 1, F, F_size,
  );

  __hetero__wait(dfg); // Wait for the dataflow graph to complete execution


 // ... Code using float* F, whose content has been modified by MyRoot's DFG.


}

The code-snippet above would bind the actual parameters (float* F, size_t F_size) to the formal parameters of MyRoot, (float* P, size_t Psz) respectively. Since we want to copy the state of the transformed array back to the host code, we specify F and it’s corresponding size in bytes F_size as an output buffer pair.

Inlined Launching of a Dataflow Graph

Often in practice when part of a larger function is to be considered to be a launching point for a HPVM Dataflow graph, user’s have had to manually outline a particular region of code into its own function to utilize the previous __hetero__launch call. For such cases, we introduced an additional launching marker, __hetero__launch_begin which enables user’s to simply define the region of code which they want to be extracted and the Hetero-C++ api will automatically outline the code accordingly and insert the __hetero__launch call with the appropriate arguments.

{
 // ... Previous code describing initialization
 // of (float* F, size_t F_size)

 void* launch = __hetero__launch_begin(
    /* Number of Input Buffer Pairs */ 1, F, F_size,
    /* Number of Output Buffer Pairs */ 1, F, F_size,
  );

  // HPVM Dataflow graph description

  __hetero__launch_end(launch); // Ending point of code region to be extracted into launch


 // ... Code using float* F, whose content has been modified by MyRoot's DFG.


}

Parallel Section, Tasks and Loops

Hetero-C++ enables specifying course-grained Task level parallelism, as well as fine-grained data-parallel parallelism via corresponding api calls.

For example, consider the following vector addition computation written at the source level.

void func(int* A, size_t ASz, int* B, size_t BSz)
{
    int loop_size = ASz / sizeof(int);

    for(int i = 0; i < loop_size; ++i){
        // Example computation using Pointer A and B
        A[i] = A[i] + B[i];

    }
}

If we want to consider this entire loop execution as a single computational task we can describe it in Hetero-C++ as:

void func(int* A, size_t ASz, int* B, size_t BSz)
{

    // Section specifies start of DFG computation node. It's return
    // value is used in the __hetero_section_end at the end
    // point the node's computation.
    void* Section = __hetero_section_begin();

    void* T1 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 2, A, ASz, B, BSz,
    /* Number of Output Buffer Pairs */ 2, A, ASz, B, BSz);

    int loop_size = ASz / sizeof(int);

    for(int i = 0; i < loop_size; ++i){
        // Example computation using Pointer A and B
        A[i] = A[i] + B[i];

    }

    __hetero_task_end(T1);


    __hetero_section_end(Section);

}

The entire loop computation will be treated as aa single task, and iterations will execute sequentially one after another.

However, consider the case where the user knows that each iteration of the loop is independent of each other, and as such can be treated as data-parallel. We can easily represent this as:

void func(int* A, size_t ASz, int* B, size_t BSz)
{
    void* Section = __hetero_section_begin();

    int loop_size = ASz / sizeof(int);

    for(int i = 0; i < loop_size; ++i){
        __hetero_parallel_loop(
        /* Number of parallel enclosing loops */ 1,
        /* Number of Input Buffer Pairs */ 2, A, ASz, B, BSz
        /* Number of Output Buffer Pairs */ 2, A, ASz, B, BSz);

        // Example computation using Pointer A and B
        A[i] = A[i] + B[i];

    }

    __hetero_section_end(Section);

}

In the above example, the loop computation will be converted into loop-size many dynamic instances, enabling each loop iteration to execute independently of each other.

Hence, users can choose the granularity of their parallelism depending on their program’s semantics.

Inter-Task Dependencies

If we have multiple tasks, we would like to be able to describe which subset of these tasks are related via dataflow dependencies. For example consider:

void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
    void* Section = __hetero_section_begin();

    void* T1 = __hetero_task_begin(...);

    // Mutates memory pointed to by P1

    __hetero_task_end(T1);

    void* T2 = __hetero_task_begin(...);

    // Mutates memory pointed to by P2

    __hetero_task_end(T2);

    void* T3 = __hetero_task_begin(...);

    // Mutates memory pointed to by P1 and by P2

    __hetero_task_end(T3);


    __hetero_section_end(Section);

}

In the above code-snippet, both Tasks T1 and T3 operate on the memory pointed to by P1, hence we would like to specify that T3 operates on the output of T1. Similarly, for T2 and T3 for memory pointed to by P2. However T1 and T2 operate on different data and hence can be executed in parallel.

These task dependencies are inferred implicitly from the __hetero_task_begin markers. The completed code-snippet is displayed below:

void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
    void* Section = __hetero_section_begin();

    void* T1 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 1, P1, P1Sz,
                                /* Number of Output Buffer Pairs */ 1, P1, P1Sz);

    // Mutates memory pointed to by P1

    __hetero_task_end(T1);

    void* T2 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 1, P2, P2Sz,
                                /* Number of Output Buffer Pairs */ 1, P2, P2Sz);

    // Mutates memory pointed to by P2

    __hetero_task_end(T2);

    void* T3 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz
                                /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);

    // Mutates memory pointed to by P1 and by P2

    __hetero_task_end(T3);


    __hetero_section_end(Section);
}

Hetero-C++ infers tasks dependencies by looking at the input and output pointers for each task. For T1 and T2, their input pointers are different, hence will be infered to be parallel to each other. Both tasks respectively mutate the array content and then specify they wrote to the pointers by including them as output pointers. The inputs of T3 are the same as the output of T1 and T2. As such T3 is inferred to be sequentially dependent on both tasks and will begin executing once both T1 and T2 finish executing.

The Parallel Task input and output semantics are as follows:

  1. If Task T contains the buffer pair (P, PSz) as one of its inputs and no Task prior to T in the same Parallel Section contains (P, PSz) as its input/output buffer pair, then T will be bound to the state of (P,PSz) at the entry to the parent Function.

  2. If Task T contains the buffer pair (P, PSz) as one of its inputs and there exists a Task, T0 prior to T in the same Parallel Section (closest to T) which contains (P, PSz) as its input/output buffer pair, then T will be be sequentially dependent to T0 for (P,PSz). Note that this dependency is expressed per buffer pair.

  3. If Task T contains the buffer pair (P, PSz) as one of its outputs and no Task after to T in the same Parallel Section contains (P, PSz) as its input/output buffer pair, then T will be return the state of (P,PSz) at the exit to the parent Function.

  4. If Task T contains the buffer pair (P, PSz) as one of its outputs and there exists a Task, T1 after T in the same Parallel Section (closest to T) which contains (P, PSz) as its input/output buffer pair, then T1 will be be sequentially dependent to T for (P,PSz). Note that this dependency is expressed per buffer pair.

By specifying the input and outputs appropriately, any complex pattern of dependency can be inferred implicitly.

The similar pattern of input/outputs can also be used for data-parallel tasks (expressed via loops):

void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
    void* Section = __hetero_section_begin();

    int P1_limits = P1Sz / sizeof(int);
    int P2_limits = P2Sz / sizeof(float);

    // Loop L1
    for(int i = 0; i < P1_limits; ++i){
        __hetero_parallel_loop(
        /* Number of parallel enclosing loops */ 1,
        /* Number of Input Buffer Pairs */ 1, P1, P1Sz,
        /* Number of Output Buffer Pairs */ 1, P1, P1Sz
        );

        // Mutates memory pointed to by P1

    }

    // Loop L2
    for(int j = 0; j < P2_limits; ++j){
        __hetero_parallel_loop(
        /* Number of parallel enclosing loops */ 1,
        /* Number of Input Buffer Pairs */ 1, P2, P2Sz,
        /* Number of Output Buffer Pairs */ 1, P2, P2Sz
        );

        // Mutates memory pointed to by P2

    }

    // Loop L3
    for(int j = 0; j < P2_limits; ++j){
        __hetero_parallel_loop(
        /* Number of parallel enclosing loops */ 1,
        /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
        /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz
        );

        // Mutates memory pointed to by P1 and by P2
    }

    __hetero_section_end(Section);
}

As with the Task examples, L1 and L2 can be executed independent of each other, while L3 will only beginn executing once L1 and L2 finish executing.

Nesting Computation

The hierarchical dataflow computation on HPVM can be expressed in 2 different ways. Users can choose to break a Parallel Task into multiple subtasks in the same source level function , which can contain complex and nested interdependencies as described above.

As an alternative mode of expression, users can use the Hetero-C++ api to describe Parallel Sections across numerous functions and invoke calls to those functions, essentially connecting the end-point nodes of their respective DFG’s and creating hierarchichy.

Both methologies are described in their following subsections below:

Within the Same Function

Consider the following example of two Parallel Tasks, T1 and T2. T1 computes the element wise product of 4 vectors, while T2 sums and accumalates the result of that computation. By looking at the input and output buffer pairs for the respective tasks, T2 is sequentially dependent on T1, and hence will begin executing after T1 finishes it’s execution.

void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
        float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
    void* Section = __hetero_section_begin();

    void* T1 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);

    // Element wise product of vectors P1 and P4
    for(int j = 0; j < P1Sz/sizeof(float); j++){
        P1[j] = P1[j] * P4[j];
    }

    // Element wise product of vectors P2 and P3
    for(int i = 0; i < P2Sz/sizeof(float); i++){
        P2[i] = P2[i] * P3[i];
    }


    __hetero_task_end(T1);

    void* T2 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);

    int dim = P1Sz / sizeof(float);


    // Element wise addition of P1 and P2
    for(int k = 0; k < dim; k++){
        P1[k] = P1[k] + P2[k];
    }

    __hetero_task_end(T2);

    __hetero_section_end(Section);
}

The computation inside Parallel Task T1 can be further divided into two independent sub-computations. Namely, the element-wise product of P1 and P4 as the first sub-computation and the element wise product of P2 and P3 as the other. To express this nesting structure, we will simply call another __hetero_section_begin / __hetero_section_end marker call inside the Parallel Task boundaries of T1 and as before declare two new Parallel Tasks inside.

void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
        float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
    void* Section = __hetero_section_begin();

    void* T1 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
    {

        void* SubSection = __hetero_section_begin();

        void* T1_1 = __hetero_task_begin(
        /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
        /* Number of Output Buffer Pairs */ 1, P1, P1Sz);


        // Element wise product of vectors P1 and P4
        for(int j = 0; j < P1Sz/sizeof(float); j++){
            P1[j] = P1[j] * P4[j];
        }

        __hetero_task_end(T1_1);

        void* T1_2 = __hetero_task_begin(
        /* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
        /* Number of Output Buffer Pairs */ 1, P2, P2Sz);



        // Element wise product of vectors P2 and P3
        for(int i = 0; i < P2Sz/sizeof(float); i++){
            P2[i] = P2[i] * P3[i];
        }

        __hetero_task_end(T1_2);

        __hetero_section_end(SubSection);
    }
    __hetero_task_end(T1);

    void* T2 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);

    int dim = P1Sz / sizeof(float);


    // Element wise addition of P1 and P2
    for(int k = 0; k < dim; k++){
        P1[k] = P1[k] + P2[k];
    }

    __hetero_task_end(T2);

    __hetero_section_end(Section);
}

Subcomputations can be repeatedly nested using this approach. Note that the input of Task T1 are the union of the input buffer pairs used inside its sub-computations, i.e.P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz. Similarly, the output of Task T1, is the union of the output of the buffer pairs of it’s sub-computations, i.e. P1, P1Sz, P2, P2Sz.

Across Functions

For larger programs, it’s often not feasible to nest the computation inside a single function as it becomes difficult to read. For such cases, Hetero-C++ allows appending the HPVM Dataflow graph across functions. Consider the MAC program we created in the previous code-snippet:

void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
        float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
    void* Section = __hetero_section_begin();

    void* T1 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
    {

        void* SubSection = __hetero_section_begin();

        void* T1_1 = __hetero_task_begin(
        /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
        /* Number of Output Buffer Pairs */ 1, P1, P1Sz);


        // Element wise product of vectors P1 and P4
        for(int j = 0; j < P1Sz/sizeof(float); j++){
            P1[j] = P1[j] * P4[j];
        }

        __hetero_task_end(T1_1);

        void* T1_2 = __hetero_task_begin(
        /* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
        /* Number of Output Buffer Pairs */ 1, P2, P2Sz);



        // Element wise product of vectors P2 and P3
        for(int i = 0; i < P2Sz/sizeof(float); i++){
            P2[i] = P2[i] * P3[i];
        }

        __hetero_task_end(T1_2);

        __hetero_section_end(SubSection);
    }
    __hetero_task_end(T1);

    void* T2 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);

    int dim = P1Sz / sizeof(float);


    // Element wise addition of P1 and P2
    for(int k = 0; k < dim; k++){
        P1[k] = P1[k] + P2[k];
    }

    __hetero_task_end(T2);

    __hetero_section_end(Section);
}

We can refactor the computation of Task T1 into another function. Then we simply call the newly created function with the appropriate arguments as folllows:

// Extracted the sub-computation of T1 in MAC into a seperate function
void Prod(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz){
    void* SubSection = __hetero_section_begin();

    void* T1_1 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
    /* Number of Output Buffer Pairs */ 1, P1, P1Sz);


    // Element wise product of vectors P1 and P4
    for(int j = 0; j < P1Sz/sizeof(float); j++){
        P1[j] = P1[j] * P4[j];
    }

    __hetero_task_end(T1_1);

    void* T1_2 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
    /* Number of Output Buffer Pairs */ 1, P2, P2Sz);



    // Element wise product of vectors P2 and P3
    for(int i = 0; i < P2Sz/sizeof(float); i++){
        P2[i] = P2[i] * P3[i];
    }

    __hetero_task_end(T1_2);

    __hetero_section_end(SubSection);
}



void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
        float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
    void* Section = __hetero_section_begin();

    void* T1 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);

    // Calling the HPVM Dataflow function Prod
    // with the appropriate arguments. Hetero-C++
    // will connect the end-points of both
    // Dataflow Graphs and create heirarchy.
    Prod(P1, P1Sz, P2, P2Sz,
        P3, P3Sz, P4, P4Sz);

    __hetero_task_end(T1);

    void* T2 = __hetero_task_begin(
    /* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
    /* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);

    int dim = P1Sz / sizeof(float);


    // Element wise addition of P1 and P2
    for(int k = 0; k < dim; k++){
        P1[k] = P1[k] + P2[k];
    }

    __hetero_task_end(T2);

    __hetero_section_end(Section);
}

Note the call to Prod inside Task T1. Hetero-C++ will recognize that Prod is also a Dataflow graph and will connect the nodes of MAC and Prod. If a non HPVM Dataflow graph function is called inside a Task, it will be considered part of the same node.

Limitations

  1. Generally, computations inside Parallel Sections should be avoided. For the Parallel Loop cases, any scalar computation is allowed, including function calls to calculate the expression for the number of loop iterations. However, currently, we do not support these calls if any of the argument is a pointer type.

  2. The __hetero_parallel_loop marker can only be used on loops which will at least enter the loop once.

  3. For the Root Node Function prototype, if any pointer argument is listed then the size of the memory associated with that pointer must be the immediate next argument in the argument list. This may require re-arranging the Root Functions arguments to reflect this. This requirement is only neccesary for the Dataflow Graph’s roots (i.e. Functions called from __hetero_launch). For nesting Dataflow Graphs across functions the argument’s can be in any order.