Writing a Hetero-C++ Program¶
The following document describes various programming patterns supported
by the Hetero-C++
api. Please refer to the Hetero-C++ Language
Specification for a detailed description of the
invidiual api calls.
A Hetero-C++ program can be divided into 2 categories. Dataflow Code which describes the structure of the hierarchical dataflow graph, and the Host Code which describes the actual inputs passed to the Dataflow Graph (DFG) and the outputs requested after its execution completes.
Launching a Dataflow Graph¶
The HPVM framework describes parallelism via a hierarchical dataflow
graph. The entry node to this graph refers to a Root Node
. Consider
the function MyRoot
below, which we want as the root node of our
Dataflow graph.
// Takes the memory pointed to by P of size Psz bytes
// and transforms its content.
void MyRoot(float* P, size_t Psz){
// HPVM Dataflow Graph ...
}
To launch the Dataflow graph described by MyRoot
, the corresponding
host code should be created.
{
// ... Previous code describing initialization
// of (float* F, size_t F_size)
void* dfg = __hetero__launch((void*) MyRoot,
/* Number of Input Buffer Pairs */ 1, F, F_size,
/* Number of Output Buffer Pairs */ 1, F, F_size,
);
__hetero__wait(dfg); // Wait for the dataflow graph to complete execution
// ... Code using float* F, whose content has been modified by MyRoot's DFG.
}
The code-snippet above would bind the actual parameters
(float* F, size_t F_size)
to the formal parameters of MyRoot
,
(float* P, size_t Psz)
respectively. Since we want to copy the state
of the transformed array back to the host code, we specify F
and
it’s corresponding size in bytes F_size
as an output buffer pair.
Inlined Launching of a Dataflow Graph¶
Often in practice when part of a larger function is to be considered to
be a launching point for a HPVM Dataflow graph, user’s have had to
manually outline a particular region of code into its own function to
utilize the previous __hetero__launch
call. For such cases, we
introduced an additional launching marker, __hetero__launch_begin
which enables user’s to simply define the region of code which they want
to be extracted and the Hetero-C++
api will automatically outline the
code accordingly and insert the __hetero__launch
call with the
appropriate arguments.
{
// ... Previous code describing initialization
// of (float* F, size_t F_size)
void* launch = __hetero__launch_begin(
/* Number of Input Buffer Pairs */ 1, F, F_size,
/* Number of Output Buffer Pairs */ 1, F, F_size,
);
// HPVM Dataflow graph description
__hetero__launch_end(launch); // Ending point of code region to be extracted into launch
// ... Code using float* F, whose content has been modified by MyRoot's DFG.
}
Parallel Section, Tasks and Loops¶
Hetero-C++ enables specifying course-grained Task level parallelism, as well as fine-grained data-parallel parallelism via corresponding api calls.
For example, consider the following vector addition computation written at the source level.
void func(int* A, size_t ASz, int* B, size_t BSz)
{
int loop_size = ASz / sizeof(int);
for(int i = 0; i < loop_size; ++i){
// Example computation using Pointer A and B
A[i] = A[i] + B[i];
}
}
If we want to consider this entire loop execution as a single
computational task we can describe it in Hetero-C++
as:
void func(int* A, size_t ASz, int* B, size_t BSz)
{
// Section specifies start of DFG computation node. It's return
// value is used in the __hetero_section_end at the end
// point the node's computation.
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, A, ASz, B, BSz,
/* Number of Output Buffer Pairs */ 2, A, ASz, B, BSz);
int loop_size = ASz / sizeof(int);
for(int i = 0; i < loop_size; ++i){
// Example computation using Pointer A and B
A[i] = A[i] + B[i];
}
__hetero_task_end(T1);
__hetero_section_end(Section);
}
The entire loop computation will be treated as aa single task, and iterations will execute sequentially one after another.
However, consider the case where the user knows that each iteration of the loop is independent of each other, and as such can be treated as data-parallel. We can easily represent this as:
void func(int* A, size_t ASz, int* B, size_t BSz)
{
void* Section = __hetero_section_begin();
int loop_size = ASz / sizeof(int);
for(int i = 0; i < loop_size; ++i){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 2, A, ASz, B, BSz
/* Number of Output Buffer Pairs */ 2, A, ASz, B, BSz);
// Example computation using Pointer A and B
A[i] = A[i] + B[i];
}
__hetero_section_end(Section);
}
In the above example, the loop computation will be converted into
loop-size
many dynamic instances, enabling each loop iteration to
execute independently of each other.
Hence, users can choose the granularity of their parallelism depending on their program’s semantics.
Inter-Task Dependencies¶
If we have multiple tasks, we would like to be able to describe which subset of these tasks are related via dataflow dependencies. For example consider:
void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(...);
// Mutates memory pointed to by P1
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(...);
// Mutates memory pointed to by P2
__hetero_task_end(T2);
void* T3 = __hetero_task_begin(...);
// Mutates memory pointed to by P1 and by P2
__hetero_task_end(T3);
__hetero_section_end(Section);
}
In the above code-snippet, both Tasks T1
and T3
operate on the
memory pointed to by P1
, hence we would like to specify that T3
operates on the output of T1
. Similarly, for T2
and T3
for
memory pointed to by P2
. However T1
and T2
operate on
different data and hence can be executed in parallel.
These task dependencies are inferred implicitly from the
__hetero_task_begin
markers. The completed code-snippet is displayed
below:
void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 1, P1, P1Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Mutates memory pointed to by P1
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 1, P2, P2Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Mutates memory pointed to by P2
__hetero_task_end(T2);
void* T3 = __hetero_task_begin(/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
// Mutates memory pointed to by P1 and by P2
__hetero_task_end(T3);
__hetero_section_end(Section);
}
Hetero-C++ infers tasks dependencies by looking at the input and output
pointers for each task. For T1
and T2
, their input pointers are
different, hence will be infered to be parallel to each other. Both
tasks respectively mutate the array content and then specify they wrote
to the pointers by including them as output pointers. The inputs of
T3
are the same as the output of T1
and T2
. As such T3
is inferred to be sequentially dependent on both tasks and will begin
executing once both T1
and T2
finish executing.
The Parallel Task
input and output semantics are as follows:
If Task
T
contains the buffer pair(P, PSz)
as one of its inputs and no Task prior toT
in the same Parallel Section contains(P, PSz)
as its input/output buffer pair, thenT
will be bound to the state of(P,PSz)
at the entry to the parent Function.If Task
T
contains the buffer pair(P, PSz)
as one of its inputs and there exists a Task,T0
prior toT
in the same Parallel Section (closest toT
) which contains(P, PSz)
as its input/output buffer pair, thenT
will be be sequentially dependent toT0
for(P,PSz)
. Note that this dependency is expressed per buffer pair.If Task
T
contains the buffer pair(P, PSz)
as one of its outputs and no Task after toT
in the same Parallel Section contains(P, PSz)
as its input/output buffer pair, thenT
will be return the state of(P,PSz)
at the exit to the parent Function.If Task
T
contains the buffer pair(P, PSz)
as one of its outputs and there exists a Task,T1
afterT
in the same Parallel Section (closest toT
) which contains(P, PSz)
as its input/output buffer pair, thenT1
will be be sequentially dependent toT
for(P,PSz)
. Note that this dependency is expressed per buffer pair.
By specifying the input and outputs appropriately, any complex pattern of dependency can be inferred implicitly.
The similar pattern of input/outputs can also be used for data-parallel tasks (expressed via loops):
void Bar(int* P1, size_t P1Sz, float* P2, size_t P2Sz)
{
void* Section = __hetero_section_begin();
int P1_limits = P1Sz / sizeof(int);
int P2_limits = P2Sz / sizeof(float);
// Loop L1
for(int i = 0; i < P1_limits; ++i){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 1, P1, P1Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz
);
// Mutates memory pointed to by P1
}
// Loop L2
for(int j = 0; j < P2_limits; ++j){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 1, P2, P2Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz
);
// Mutates memory pointed to by P2
}
// Loop L3
for(int j = 0; j < P2_limits; ++j){
__hetero_parallel_loop(
/* Number of parallel enclosing loops */ 1,
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz
);
// Mutates memory pointed to by P1 and by P2
}
__hetero_section_end(Section);
}
As with the Task examples, L1
and L2
can be executed independent
of each other, while L3
will only beginn executing once L1
and
L2
finish executing.
Nesting Computation¶
The hierarchical dataflow computation on HPVM
can be expressed in 2
different ways. Users can choose to break a Parallel Task into multiple
subtasks in the same source level function , which can contain complex
and nested interdependencies as described above.
As an alternative mode of expression, users can use the Hetero-C++
api
to describe Parallel Sections across numerous functions and invoke calls
to those functions, essentially connecting the end-point nodes of their
respective DFG’s and creating hierarchichy.
Both methologies are described in their following subsections below:
Within the Same Function¶
Consider the following example of two Parallel Tasks, T1 and T2. T1 computes the element wise product of 4 vectors, while T2 sums and accumalates the result of that computation. By looking at the input and output buffer pairs for the respective tasks, T2 is sequentially dependent on T1, and hence will begin executing after T1 finishes it’s execution.
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
The computation inside Parallel Task T1
can be further divided into
two independent sub-computations. Namely, the element-wise product of P1
and P4 as the first sub-computation and the element wise product of P2
and P3 as the other. To express this nesting structure, we will simply
call another __hetero_section_begin / __hetero_section_end
marker
call inside the Parallel Task boundaries of T1 and as before declare two
new Parallel Tasks inside.
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
{
void* SubSection = __hetero_section_begin();
void* T1_1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
__hetero_task_end(T1_1);
void* T1_2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1_2);
__hetero_section_end(SubSection);
}
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
Subcomputations can be repeatedly nested using this approach. Note that
the input of Task T1
are the union of the input buffer pairs used
inside its sub-computations,
i.e.P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz
. Similarly, the output
of Task T1
, is the union of the output of the buffer pairs of it’s
sub-computations, i.e. P1, P1Sz, P2, P2Sz
.
Across Functions¶
For larger programs, it’s often not feasible to nest the computation
inside a single function as it becomes difficult to read. For such
cases, Hetero-C++
allows appending the HPVM Dataflow graph across
functions. Consider the MAC
program we created in the previous
code-snippet:
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
{
void* SubSection = __hetero_section_begin();
void* T1_1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
__hetero_task_end(T1_1);
void* T1_2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1_2);
__hetero_section_end(SubSection);
}
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
We can refactor the computation of Task T1
into another function.
Then we simply call the newly created function with the appropriate
arguments as folllows:
// Extracted the sub-computation of T1 in MAC into a seperate function
void Prod(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz){
void* SubSection = __hetero_section_begin();
void* T1_1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 1, P1, P1Sz);
// Element wise product of vectors P1 and P4
for(int j = 0; j < P1Sz/sizeof(float); j++){
P1[j] = P1[j] * P4[j];
}
__hetero_task_end(T1_1);
void* T1_2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P2, P2Sz, P3, P3Sz,
/* Number of Output Buffer Pairs */ 1, P2, P2Sz);
// Element wise product of vectors P2 and P3
for(int i = 0; i < P2Sz/sizeof(float); i++){
P2[i] = P2[i] * P3[i];
}
__hetero_task_end(T1_2);
__hetero_section_end(SubSection);
}
void MAC(float* P1, size_t P1Sz, float* P2, size_t P2Sz,
float* P3, size_t P3Sz, float* P4, size_t P4Sz)
{
void* Section = __hetero_section_begin();
void* T1 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 4, P1, P1Sz, P2, P2Sz, P3, P3Sz, P4, P4Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
// Calling the HPVM Dataflow function Prod
// with the appropriate arguments. Hetero-C++
// will connect the end-points of both
// Dataflow Graphs and create heirarchy.
Prod(P1, P1Sz, P2, P2Sz,
P3, P3Sz, P4, P4Sz);
__hetero_task_end(T1);
void* T2 = __hetero_task_begin(
/* Number of Input Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz,
/* Number of Output Buffer Pairs */ 2, P1, P1Sz, P2, P2Sz);
int dim = P1Sz / sizeof(float);
// Element wise addition of P1 and P2
for(int k = 0; k < dim; k++){
P1[k] = P1[k] + P2[k];
}
__hetero_task_end(T2);
__hetero_section_end(Section);
}
Note the call to Prod
inside Task T1
. Hetero-C++
will
recognize that Prod
is also a Dataflow graph and will connect the
nodes of MAC
and Prod
. If a non HPVM Dataflow graph function is
called inside a Task, it will be considered part of the same node.
Limitations¶
Generally, computations inside Parallel Sections should be avoided. For the Parallel Loop cases, any scalar computation is allowed, including function calls to calculate the expression for the number of loop iterations. However, currently, we do not support these calls if any of the argument is a pointer type.
The
__hetero_parallel_loop
marker can only be used on loops which will at least enter the loop once.For the Root Node Function prototype, if any pointer argument is listed then the size of the memory associated with that pointer must be the immediate next argument in the argument list. This may require re-arranging the Root Functions arguments to reflect this. This requirement is only neccesary for the Dataflow Graph’s roots (i.e. Functions called from
__hetero_launch
). For nesting Dataflow Graphs across functions the argument’s can be in any order.