Version 5 (modified by gemacke, 9 years ago) (diff)

--

Visiting the Adjacencies of a List of Vertices

Traversing the adjacencies of a list of vertices is a simple graph kernel that is used in more complicated algorithms such as single source shortest path and breadth first search. In an adjacency list traversal, some operation is typically applied to each visited vertex. The operation is often simple but can also be more complex. We give a series of algorithms for adjacency list traversal to demonstrate how good performance and parallelization can be achieved while using some generic constructs in the MTGL. All of the adjacency list traversal algorithms are for a compressed sparse row (CSR) format graph. The code for the adjacency list traversal algorithms is located at test/test_static_graph6.cpp

Consider Algorithm 1 shown below.

```void compute_in_degree(static_graph<directedS>* g, size_type* indeg)
{
size_type i, j;
size_type order = g->n;
size_type* index = g->index;
size_type* end_points = g->end_points;
#pragma mta assert noalias *indeg
for (i = 0; i < order; i++)
{
size_type begin = index[i];
size_type end = index[i + 1];
for (j = begin; j < end; j++)
{
if (i < end_points[j]) mt_incr(indeg[end_points[j]], 1);
}
}
}
```

The algorithm is the simple C implementation of traversing the adjacencies of all the vertices in a graph to determine for each vertex the number of adjacent vertices that have a higher id. It is a simple nested loop whose body is an simple if statement.

Now look at the canal output.

```           |   #pragma mta assert noalias *indeg
|   for (i = 0; i < order; i++)
|   {
|     size_type begin = index[i];
|     size_type end = index[i + 1];
|     for (j = begin; j < end; j++)
|     {
4 PP:m |       if (i < end_points[j]) mt_incr(indeg[end_points[j]], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
|     }
|   }
```

Note the PP:m marking to the left of the if statement. The capital P ’s mean the compiler automatically parallelized both levels of the loops. The m means the compiler automatically performed a manhattan loop collapse.

Looking at the annotation for loop 4, we see that the body of the loop performs only two memory references and a total of six instructions.

```Loop   4 in compute_in_degree(mtgl::static_graph<(int)1> *, unsigned long *) at line 64 in loop 3
Loop not pipelined: not structurally ok
Parallel section of loop from level 2
Loop summary: 6 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls
```

But what happens if a user desires to do something more complicated? Consider Algorithm 2 shown below.

```typedef bool (*pred_t)(size_type, size_type);

#pragma mta inline
inline bool test_pred_func(size_type i, size_type j)
{
return (i < j);
}

void compute_in_degree_func_ptr(static_graph<directedS>* g,
size_type* indeg, pred_t my_func)
{
size_type i, j;
size_type order = g->n;
size_type* index = g->index;
size_type* end_points = g->end_points;
#pragma mta assert noalias *indeg
#pragma mta assert parallel
for (i = 0; i < order; i++)
{
size_type begin = index[i];
size_type end = index[i + 1];
#pragma mta assert parallel
for (j = begin; j < end; j++)
{
if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
}
}
}
```

This algorithm does the same thing as Algorithm 1, but it performs the id test by calling a function via a function pointer. A function pointer is the natural C method to allow a user to perform a generic operation on each of the adjacencies.

Looking at the canal output and the annotation for loop 6, we see that the compiler cannot inline the function called by the function pointer, and the number of instructions and memory references increases significantly. In addition, note the pp:m marking to the left of the function call. The small p’s mean that the compiler could not automatically parallelize the loop but that the user forced parallelization using a pragma statement.

```           |   #pragma mta assert noalias *indeg
|   #pragma mta assert parallel
|   for (i = 0; i < order; i++)
|   {
|     size_type begin = index[i];
|     size_type end = index[i + 1];
|     #pragma mta assert parallel
|     for (j = begin; j < end; j++)
|     {
6 pp:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
|     }
|   }
```
```Loop   6 in compute_in_degree_func_ptr(mtgl::static_graph<(int)1> *, unsigned long *,
bool (*)(unsigned long, unsigned long)) at line 93 in region 5
In parallel phase 3
Interleave scheduled
General loop collapse
Loop not pipelined: not structurally ok
Loop from level 2 combined with this loop
Loop summary: 25 instructions, 0 floating point operations
4 loads, 1 stores, 12 reloads, 0 spills, 2 branches, 2 calls
```

When programming in C++, a natural method to allow a user to perform a generic operation is a function object. Consider Algorithm 3 shown below. This algorithm is the same as the previous ones except it performs the id test using a function object.

```class test_predC {
public:
inline bool operator()(size_type i, size_type j) { return (i < j); }
};

template <class pred>
void compute_in_degree_func_obj(static_graph<directedS>* g,
size_type* indeg, pred my_func)
{
size_type order = g->n;
size_type *index = g->index;
size_type *end_points = g->end_points;
#pragma mta assert noalias *indeg
for (size_type i = 0; i < order; i++)
{
size_type begin = index[i];
size_type end = index[i + 1];
for (size_type j = begin; j < end; j++)
{
if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
}
}
}
```

Looking at the canal output, we see that once again the compiler was able to automatically parallelize the loop and perform a manhattan loop collapse.

```           |   #pragma mta assert noalias *indeg
|   for (size_type i = 0; i < order; i++)
|   {
|     size_type begin = index[i];
|     size_type end = index[i + 1];
|     for (size_type j = begin; j < end; j++)
|     {
24 PP:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
++ function test_predC::operator  inlined           |     }
|   }
```

Looking at the annotation for loop 24, we see that the number of memory references and instructions is the same as Algorithm 1.

```Loop  24 in void compute_in_degree_func_obj<T1>(mtgl::static_graph<(int)1> *, unsigned long *, T1)
[with T1=test_predC] at line 117 in loop 21
Loop not pipelined: not structurally ok
Parallel section of loop from level 2
Loop summary: 6 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls
```

Now, consider Algorithm 4 shown below. This is the same as Algorithm 3 (uses a function object), but it uses the syntactic sugar of the MTGL graph API. The graph adapter used is a static graph adapter, which is the MTGL’s implementation of a compressed sparse row graph.

```template <class Graph, class predicate>
void compute_in_degree_mtgl(Graph& g,
typename graph_traits<Graph>::size_type* indeg,
predicate f)
{
typedef typename graph_traits<Graph>::size_type size_type;
typedef typename graph_traits<Graph>::vertex_descriptor vertex_t;
typedef typename graph_traits<Graph>::vertex_iterator vertex_iterator_t;

vertex_id_map<Graph> vid_map = get(_vertex_id_map, g);
size_type order = num_vertices(g);
vertex_iterator_t verts = vertices(g);
#pragma mta assert noalias *indeg
for (size_type i = 0; i < order; i++)
{
vertex_t u = verts[i];
size_type uid = get(vid_map, u);
size_type end = out_degree(u, g);
for (size_type j = 0; j < end; j++)
{
size_type vid = get(vid_map, v);
if (f(uid, vid)) mt_incr(indeg[vid], 1);
}
}
}
```

Looking at the canal output and the annotation for loop 25, we see that the MTGL version performed the same number of instructions and memory references as for this graph adapter as Algorithm 1. The compiler was able to eliminate the extra operations seemingly implied by the syntactic sugar, and it was able to automatically parallelize the loops involving the iterators. In this case we achieved our dual goals of generic code and good parallel performance.

```           |   #pragma mta assert noalias *indeg           |   for (size_type i = 0; i < order; i++)
|   {
|     vertex_t u = verts[i];++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined
|     size_type uid = get(vid_map, u);++ function T2 mtgl::get<T1, T2, T3> inlined
|     size_type end = out_degree(u, g);
|     for (size_type j = 0; j < end; j++)
|     {
++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined
25 PP:m |       size_type vid = get(vid_map, v);
++ function T2 mtgl::get<T1, T2, T3> inlined
25 PP:m |       if (f(uid, vid)) mt_incr(indeg[vid], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
++ function test_predC::operator  inlined
|     }
|   }
```
```Loop  25 in void compute_in_degree_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::static_graph_adapter<(int)1>, T2=test_predC] at line 140 in loop 22
Loop not pipelined: not structurally ok
Parallel section of loop from level 2
Loop summary: 6 instructions, 0 floating point operations
1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls
```

The operation applied to each adjacency in the above algorithms is quite simple. What happens if the operation becomes more complex? It is possible for a user to choose an operation that the compiler may not be able to parallelize well or at all. In this case parallelism could still be achieved by manually dividing the adjacencies into equal-sized chunks and assigning the chunks to threads. Consider the diagram, which represents Algorithm 5, shown below. The first row of boxes is the queue of vertices whose adjacencies we want to visit. The second row of boxes is the vertices array of the CSR graph. The third row of boxes is the adjacencies array of the CSR graph. Algorithm 5 performs the precomputation necessary to evenly divide the adjacencies into chunks and manually assigns those chunks to threads. This algorithm has overhead the others do not, but it allows parallelization in more circumstances.

Performance Results

The five algorithms presented for visiting the adjacencies of the vertices of a graph are tested against two graphs. The first is an Erdos-Renyi random graph of two billion edges. The second is a graph of realistic data with 0.5 billion edges. The Erdos-Renyi graph is an example of simple synthetic data, while the realistic data graph is an example of more complex data. The figures below show the time in seconds it took to perform the five algorithms on each of the graphs for varying numbers of processors. The results for the Erdos-Renyi graph are on the left while the results for the realistic data graph are on the right. The algorithms performed similarly for both the simple and realistic data. Algorithms 1, 3, and 4 performed roughly the same as each other, and the fully generic algorithm that manually chunks the adjacencies performed about two to three times slower than the better algorithms. The algorithm that used the function pointer (Algorithm 2), however, performed horribly. All the algorithms except for Algorithm 2 scaled up to 64 processors.