# Writing a Hetero-C++ Program¶

The following document describes various programming patterns supported
by the `Hetero-C++`

api. Please refer to the Hetero-C++ Language
Specification for a detailed description of the
invidiual api calls.

A Hetero-C++ program can be divided into 2 categories. **Dataflow Code**
which describes the structure of the hierarchical dataflow graph, and
the **Host Code** which describes the actual inputs passed to the
Dataflow Graph (DFG) and the outputs requested after its execution
completes.

## Launching a Dataflow Graph¶

The HPVM framework describes parallelism via a hierarchical dataflow
graph. The entry node to this graph refers to a `Root Node`

. Consider
the function `MyRoot`

below, which we want as the root node of our
Dataflow graph.

```
// Takes the memory pointed to by P of size Psz bytes
// and transforms its content.
void MyRoot(float* P, size_t Psz){
// HPVM Dataflow Graph ...
}
```

To launch the Dataflow graph described by `MyRoot`

, the corresponding
host code should be created.

```
{
// ... Previous code describing initialization
// of (float* F, size_t F_size)
void* dfg = __hetero__launch((void*) MyRoot,
/* Number of Input Buffer Pairs */ 1, F, F_size,
/* Number of Output Buffer Pairs */ 1, F, F_size,
);
__hetero__wait(dfg); // Wait for the dataflow graph to complete execution
// ... Code using float* F, whose content has been modified by MyRoot's DFG.
}
```

The code-snippet above would bind the actual parameters
`(float* F, size_t F_size)`

to the formal parameters of `MyRoot`

,
`(float* P, size_t Psz)`

respectively. Since we want to copy the state
of the transformed array back to the host code, we specify `F`

and
it’s corresponding size in bytes `F_size`

as an output buffer pair.

## Inlined Launching of a Dataflow Graph¶

Often in practice when part of a larger function is to be considered to
be a **launching point** for a HPVM Dataflow graph, user’s have had to
manually outline a particular region of code into its own function to
utilize the previous `__hetero__launch`

call. For such cases, we
introduced an additional launching marker, `__hetero__launch_begin`

which enables user’s to simply define the region of code which they want
to be extracted and the `Hetero-C++`

api will automatically outline the
code accordingly and insert the `__hetero__launch`

call with the
appropriate arguments.

```
{
// ... Previous code describing initialization
// of (float* F, size_t F_size)
void* launch = __hetero__launch_begin(
/* Number of Input Buffer Pairs */ 1, F, F_size,
/* Number of Output Buffer Pairs */ 1, F, F_size,
);
// HPVM Dataflow graph description
__hetero__launch_end(launch); // Ending point of code region to be extracted into launch
// ... Code using float* F, whose content has been modified by MyRoot's DFG.
}
```

## Parallel Section, Tasks and Loops¶

Hetero-C++ enables specifying course-grained Task level parallelism, as well as fine-grained data-parallel parallelism via corresponding api calls.

For example, consider the following vector addition computation written at the source level.

```
void func(int* A, size_t ASz, int* B, size_t BSz)
{
int loop_size = ASz / sizeof(int);
for(int i = 0; i < loop_size; ++i){
// Example computation using Pointer A and B
A[i] = A[i] + B[i];
}
}
```

If we want to consider this entire loop execution as a single
computational task we can describe it in `Hetero-C++`

as:

```
void func(int* A, size_t ASz, int* B, size_t BSz)
{
// Section specifies start of DFG computation node. It's return
// value is used in the __hetero_section_end at the end
// point the node's computation.
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, A, ASz, B, BSz,
/* Number of Output Buffer Pairs */ 2, A, ASz, B, BSz);
int loop_size = ASz / sizeof(int);
for(int i = 0; i < loop_size; ++i){
// Example computation using Pointer A and B
A[i] = A[i] + B[i];
}
__hetero_task_end(T1);
__hetero_section_end(Section);
}
```

The entire loop computation will be treated as aa single task, and iterations will execute sequentially one after another.

However, consider the case where the user knows that each iteration of the loop is independent of each other, and as such can be treated as data-parallel. We can easily represent this as:

```
void func(int* A, size_t ASz, int* B, size_t BSz)
{
void* Section = __hetero_section_begin();
int loop_size = ASz / sizeof(int);
for(int i = 0; i < loop_size; ++i){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 2, A, ASz, B, BSz
/* Number of Output Buffer Pairs */ 2, A, ASz, B, BSz);
// Example computation using Pointer A and B
A[i] = A[i] + B[i];
}
__hetero_section_end(Section);
}
```

In the above example, the loop computation will be converted into
`loop-size`

many dynamic instances, enabling each loop iteration to
execute independently of each other.

Hence, users can choose the granularity of their parallelism depending on their program’s semantics.

## Inter-Task Dependencies¶

If we have multiple tasks, we would like to be able to describe which subset of these tasks are related via dataflow dependencies. For example consider:

```
void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(...);
// Mutates memory pointed to by P1
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(...);
// Mutates memory pointed to by P2
__hetero_task_end(T2);
void* T3 = __hetero_task_begin(...);
// Mutates memory pointed to by P1 and by P2
__hetero_task_end(T3);
__hetero_section_end(Section);
}
```

In the above code-snippet, both Tasks `T1`

and `T3`

operate on the
memory pointed to by `P1`

, hence we would like to specify that `T3`

operates on the output of `T1`

. Similarly, for `T2`

and `T3`

for
memory pointed to by `P2`

. However `T1`

and `T2`

operate on
different data and hence can be executed in parallel.

These task dependencies are inferred implicitly from the
`__hetero_task_begin`

markers. The completed code-snippet is displayed
below:

```
void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 1, P1, P1Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Mutates memory pointed to by P1
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 1, P2, P2Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Mutates memory pointed to by P2
__hetero_task_end(T2);
void* T3 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
// Mutates memory pointed to by P1 and by P2
__hetero_task_end(T3);
__hetero_section_end(Section);
}
```

Hetero-C++ infers tasks dependencies by looking at the input and output
pointers for each task. For `T1`

and `T2`

, their input pointers are
different, hence will be infered to be parallel to each other. Both
tasks respectively mutate the array content and then specify they wrote
to the pointers by including them as output pointers. The inputs of
`T3`

are the same as the output of `T1`

and `T2`

. As such `T3`

is inferred to be sequentially dependent on both tasks and will begin
executing once **both** `T1`

and `T2`

finish executing.

The `Parallel Task`

input and output semantics are as follows:

If Task

`T`

contains the buffer pair`(P, PSz)`

as one of its inputs and no Task prior to`T`

in the same Parallel Section contains`(P, PSz)`

as its input/output buffer pair, then`T`

will be bound to the state of`(P,PSz)`

at the entry to the parent Function.If Task

`T`

contains the buffer pair`(P, PSz)`

as one of its inputs and there exists a Task,`T0`

prior to`T`

in the same Parallel Section (closest to`T`

) which contains`(P, PSz)`

as its input/output buffer pair, then`T`

will be be sequentially dependent to`T0`

for`(P,PSz)`

. Note that this dependency is expressed per buffer pair.If Task

`T`

contains the buffer pair`(P, PSz)`

as one of its outputs and no Task after to`T`

in the same Parallel Section contains`(P, PSz)`

as its input/output buffer pair, then`T`

will be return the state of`(P,PSz)`

at the exit to the parent Function.If Task

`T`

contains the buffer pair`(P, PSz)`

as one of its outputs and there exists a Task,`T1`

after`T`

in the same Parallel Section (closest to`T`

) which contains`(P, PSz)`

as its input/output buffer pair, then`T1`

will be be sequentially dependent to`T`

for`(P,PSz)`

. Note that this dependency is expressed per buffer pair.

By specifying the input and outputs appropriately, any complex pattern of dependency can be inferred implicitly.

The similar pattern of input/outputs can also be used for data-parallel tasks (expressed via loops):

```
void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
void* Section = __hetero_section_begin();
int P1_limits = P1Sz / sizeof(int);
int P2_limits = P2Sz / sizeof(float);
// Loop L1
for(int i = 0; i < P1_limits; ++i){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 1, P1, P1Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz
);
// Mutates memory pointed to by P1
}
// Loop L2
for(int j = 0; j < P2_limits; ++j){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 1, P2, P2Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz
);
// Mutates memory pointed to by P2
}
// Loop L3
for(int j = 0; j < P2_limits; ++j){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz
);
// Mutates memory pointed to by P1 and by P2
}
__hetero_section_end(Section);
}
```

As with the Task examples, `L1`

and `L2`

can be executed independent
of each other, while `L3`

will only beginn executing once `L1`

and
`L2`

finish executing.

## Nesting Computation¶

The hierarchical dataflow computation on `HPVM`

can be expressed in 2
different ways. Users can choose to break a Parallel Task into multiple
subtasks in the same source level function , which can contain complex
and nested interdependencies as described above.

As an alternative mode of expression, users can use the `Hetero-C++`

api
to describe Parallel Sections across numerous functions and invoke calls
to those functions, essentially connecting the end-point nodes of their
respective DFG’s and creating hierarchichy.

Both methologies are described in their following subsections below:

### Within the Same Function¶

Consider the following example of two Parallel Tasks, T1 and T2. T1 computes the element wise product of 4 vectors, while T2 sums and accumalates the result of that computation. By looking at the input and output buffer pairs for the respective tasks, T2 is sequentially dependent on T1, and hence will begin executing after T1 finishes it’s execution.

```
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
```

The computation inside Parallel Task `T1`

can be further divided into
two independent sub-computations. Namely, the element-wise product of P1
and P4 as the first sub-computation and the element wise product of P2
and P3 as the other. To express this nesting structure, we will simply
call another `__hetero_section_begin / __hetero_section_end`

marker
call inside the Parallel Task boundaries of T1 and as before declare two
new Parallel Tasks inside.

```
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
{
void* SubSection = __hetero_section_begin();
void* T1_1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
__hetero_task_end(T1_1);
void* T1_2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1_2);
__hetero_section_end(SubSection);
}
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
```

Subcomputations can be repeatedly nested using this approach. Note that
the input of Task `T1`

are the union of the input buffer pairs used
inside its sub-computations,
i.e.`P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz`

. Similarly, the output
of Task `T1`

, is the union of the output of the buffer pairs of it’s
sub-computations, i.e. `P1, P1Sz, P2, P2Sz`

.

### Across Functions¶

For larger programs, it’s often not feasible to nest the computation
inside a single function as it becomes difficult to read. For such
cases, `Hetero-C++`

allows appending the HPVM Dataflow graph across
functions. Consider the `MAC`

program we created in the previous
code-snippet:

```
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
{
void* SubSection = __hetero_section_begin();
void* T1_1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
__hetero_task_end(T1_1);
void* T1_2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1_2);
__hetero_section_end(SubSection);
}
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
```

We can refactor the computation of Task `T1`

into another function.
Then we simply call the newly created function with the appropriate
arguments as folllows:

```
// Extracted the sub-computation of T1 in MAC into a seperate function
void Prod(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz){
void* SubSection = __hetero_section_begin();
void* T1_1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
__hetero_task_end(T1_1);
void* T1_2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1_2);
__hetero_section_end(SubSection);
}
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
// Calling the HPVM Dataflow function Prod
// with the appropriate arguments. Hetero-C++
// will connect the end-points of both
// Dataflow Graphs and create heirarchy.
Prod(P1, P1Sz, P2, P2Sz,
P3, P3Sz, P4, P4Sz);
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
```

Note the call to `Prod`

inside Task `T1`

. `Hetero-C++`

will
recognize that `Prod`

is also a Dataflow graph and will connect the
nodes of `MAC`

and `Prod`

. If a non HPVM Dataflow graph function is
called inside a Task, it will be considered part of the same node.

## Limitations¶

Generally, computations inside Parallel Sections should be avoided. For the Parallel Loop cases, any

*scalar*computation is allowed, including function calls to calculate the expression for the number of loop iterations. However, currently, we do not support these calls if any of the argument is a pointer type.The

`__hetero_parallel_loop`

marker can only be used on loops which will at least enter the loop once.For the Root Node Function prototype, if any pointer argument is listed then the size of the memory associated with that pointer must be the immediate next argument in the argument list. This may require re-arranging the Root Functions arguments to reflect this. This requirement is only neccesary for the Dataflow Graph’s roots (i.e. Functions called from

`__hetero_launch`

). For nesting Dataflow Graphs*across functions*the argument’s can be in any order.