wiki:BreadthFirstSearch

Visiting the Adjacencies of a List of Vertices

Traversing the adjacencies of a list of vertices is a simple graph kernel that is used in more complicated algorithms such as single source shortest path and breadth first search. In an adjacency list traversal, some operation is typically applied to each visited vertex. The operation is often simple but can also be more complex. We give a series of algorithms for adjacency list traversal to demonstrate how good performance and parallelization can be achieved while using some generic constructs in the MTGL. All of the adjacency list traversal algorithms are for a compressed sparse row (CSR) format graph. The code for the adjacency list traversal algorithms is located at test/test_adjacency_iteration.cpp. To compile this program:

cd ../test
c++ -I.. -o test_adjacency_iteration test_adjacency_iteration.cpp -lm -lprand

The prand library contains Cray's native XMT random number generators.

We describe five algorithms for visiting the adjacencies of a list of vertices.

  1. Directly accessing the CSR structure.
  2. Directly accessing the CSR structure using a function pointer.
  3. Directly accessing the CSR structure using a function object.
  4. Using the MTGL graph API and a function object.
  5. Manual load balancing using a function object.

Algorithm 1 - Directly Accessing the CSR Structure

Consider Algorithm 1 shown below.

template <typename Graph>
void
count_adjacencies_higher_id(Graph& g,
                            typename graph_traits<Graph>::size_type* indeg)
{
  typedef typename graph_traits<Graph>::size_type size_type;

  size_type order = g.get_order();
  size_type* index = g.get_index();
  size_type* end_points = g.get_end_points();

  #pragma mta assert noalias *indeg
  for (size_type i = 0; i < order; ++i)
  {
    size_type begin = index[i];
    size_type end = index[i + 1];

    for (size_type j = begin; j < end; ++j)
    {
      if (i < end_points[j]) mt_incr(indeg[end_points[j]], 1);
    }
  }
}

The algorithm grabs the index and end points arrays from the CSR graph implementation and traverses the adjacencies of all the vertices in a graph to determine for each vertex the number of adjacent vertices that have a higher id. It is a simple nested loop whose body is an simple if statement.

Now look at the canal output for the nested loop.

           |   #pragma mta assert noalias *indeg
           |   for (size_type i = 0; i < order; ++i)
           |   {
           |     size_type begin = index[i];
           |     size_type end = index[i + 1];
           | 
           |     for (size_type j = begin; j < end; ++j)
           |     {
  104 PP:m |       if (i < end_points[j]) mt_incr(indeg[end_points[j]], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
           |     }
           |   }

Note the PP:m marking to the left of the if statement. The capital P ’s mean the compiler automatically parallelized both levels of the loops. The m means the compiler automatically performed a manhattan loop collapse.

To see the real cost of this loop, we need to search through the loop annotations to see what loops (including compiler generated ones) there are for the function count_adjacencies_higher_id(). There are four pertinent loops: 38, 52, 66, and 104.

Loops 38, 52, and 66 represent a recurrence. Loop 38 is stage 1 of the recurrence. It performs 3 instructions including 1 load and 1 store. Loop 52 is the communication phase for stage 1 of the recurrence. It performs 13 instructions including 1 load, 1 store, 1 reload, and 3 branches. Loop 66 is stage 2 of the recurrence. It performs 2 instructions including 1 load and 1 store.

Loop  38 in void count_adjacencies_higher_id<T1>(T1 &, mtgl::graph_traits<T1>::size_type *)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 24
       Stage 1 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		3 instructions, needs 44 streams for full utilization
		pipelined

Loop  52 in void count_adjacencies_higher_id<T1>(T1 &, mtgl::graph_traits<T1>::size_type *)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 24
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 13 instructions, 0 floating point operations
		1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls

Loop  66 in void count_adjacencies_higher_id<T1>(T1 &, mtgl::graph_traits<T1>::size_type *)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 24
       Stage 2 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		2 instructions, needs 90 streams for full utilization
		pipelined

Loop 104 is the main work performed by the body of the inner loop. It performs 6 instructions including 1 load, 1 store, and 2 branches.

Loop 104 in void count_adjacencies_higher_id<T1>(T1 &, mtgl::graph_traits<T1>::size_type *)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] at line 78 in loop 92
       Loop not pipelined: not structurally ok
       Parallel section of loop from level 2
       Loop summary: 6 instructions, 0 floating point operations
		1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls

Algorithm 2 - Directly Accessing the CSR Structure Using a Function Pointer

But what happens if a user desires to do something more complicated? Consider Algorithm 2 shown below.

typedef bool (*pred_t)(size_type, size_type);

#pragma mta inline
inline bool test_pred_func(size_type i, size_type j)
{
  return (i < j);
}

template <typename Graph>
void
count_adjacencies_higher_id_func_ptr(
    Graph& g, typename graph_traits<Graph>::size_type* indeg, pred_t my_func)
{
  typedef typename graph_traits<Graph>::size_type size_type;

  size_type order = g.get_order();
  size_type* index = g.get_index();
  size_type* end_points = g.get_end_points();

  #pragma mta assert noalias *indeg
  #pragma mta assert parallel
  for (size_type i = 0; i < order; ++i)
  {
    size_type begin = index[i];
    size_type end = index[i + 1];

    #pragma mta assert parallel
    for (size_type j = begin; j < end; ++j)
    {
      if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
    }
  }
}

This algorithm does the same thing as Algorithm 1, but it performs the id test by calling a function via a function pointer. The function test_pred_func is the function passed to count_adjacencies_higher_id_func_ptr() later in the code. A function pointer is the natural C method to allow a user to perform a generic operation on each of the adjacencies.

Looking at the canal output, we see that the compiler doesn't inline the function called by the function pointer, and we have to use assert parallel pragmas to get the compiler to parallelize the code. This is noted by the pp:m marking to the left of the function call. The small p’s mean that the compiler could not automatically parallelize the loop but that the user forced parallelization using a pragma statement. The number of instructions and memory references increases significantly.

           |   #pragma mta assert noalias *indeg
           |   #pragma mta assert parallel
           |   for (size_type i = 0; i < order; ++i)
           |   {
           |     size_type begin = index[i];
           |     size_type end = index[i + 1];
           | 
           |     #pragma mta assert parallel
           |     for (size_type j = begin; j < end; ++j)
           |     {
   80 pp:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
           |     }
           |   }

Looking at the loop annotations, there are four pertinent loops: 39, 53, 67, and 80.

Loops 39, 53, and 67 represent a recurrence. Loop 39 is stage 1 of the recurrence. It performs 3 instructions including 1 load and 1 store. Loop 53 is the communication phase for stage 1 of the recurrence. It performs 13 instructions including 1 load, 1 store, 1 reload, and 3 branches. Loop 13 is stage 2 of the recurrence. It performs 2 instructions including 1 load and 1 store. This part has the same cost as Algorithm 1.

Loop  39 in void count_adjacencies_higher_id_func_ptr<T1>(T1 &, mtgl::graph_traits<T1>::size_type *,
                                                          bool (*)(unsigned long, unsigned long))
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 25
       Stage 1 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		3 instructions, needs 44 streams for full utilization
		pipelined

Loop  53 in void count_adjacencies_higher_id_func_ptr<T1>(T1 &, mtgl::graph_traits<T1>::size_type *,
                                                          bool (*)(unsigned long, unsigned long))
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 25
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 13 instructions, 0 floating point operations
		1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls

Loop  67 in void count_adjacencies_higher_id_func_ptr<T1>(T1 &, mtgl::graph_traits<T1>::size_type *,
                                                          bool (*)(unsigned long, unsigned long))
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] in loop 25
       Stage 2 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		2 instructions, needs 90 streams for full utilization
		pipelined

Loop 80 is the main work performed by the body of the inner loop. It performs 25 instructions including 4 loads, 1 store, 12 reloads, and 2 branches. This is a significant increase in instructions and memory operations over the previous version!

Loop  80 in void count_adjacencies_higher_id_func_ptr<T1>(T1 &, mtgl::graph_traits<T1>::size_type *,
                                                          bool (*)(unsigned long, unsigned long))
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>] at line 112 in region 6
       In parallel phase 3
       Interleave scheduled
       General loop collapse
       Loop not pipelined: not structurally ok
       Loop from level 2 combined with this loop
       Loop summary: 25 instructions, 0 floating point operations
		4 loads, 1 stores, 12 reloads, 0 spills, 2 branches, 2 calls

Algorithm 3 - Directly Accessing the CSR Structure Using a Function Object

When programming in C++, a natural method to allow a user to perform a generic operation is a function object. Consider Algorithm 3 shown below. This algorithm is the same as the previous one except it performs the id test using a function object. The function object test_pred_func_obj is passed to count_adjacencies_higher_id_func_obj later in the code.

template <typename Graph>
class test_pred_func_obj {
public:
  typedef typename graph_traits<Graph>::size_type size_type;

  inline bool operator()(size_type i, size_type j) { return (i < j); }
};

template <typename Graph, typename pred>
void
count_adjacencies_higher_id_func_obj(
    Graph& g, typename graph_traits<Graph>::size_type* indeg, pred my_func)
{
  typedef typename graph_traits<Graph>::size_type size_type;

  size_type order = g.get_order();
  size_type* index = g.get_index();
  size_type* end_points = g.get_end_points();

  #pragma mta assert noalias *indeg
  for (size_type i = 0; i < order; ++i)
  {
    size_type begin = index[i];
    size_type end = index[i + 1];

    for (size_type j = begin; j < end; ++j)
    {
      if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
    }
  }
}

Looking at the canal output, we see that once again the compiler was able to automatically parallelize the loop and perform a manhattan loop collapse. It also was able to inline the call to the function object, which is promising.

           |   #pragma mta assert noalias *indeg
           |   for (size_type i = 0; i < order; ++i)
           |   {
           |     size_type begin = index[i];
           |     size_type end = index[i + 1];
           | 
           |     for (size_type j = begin; j < end; ++j)
           |     {
  105 PP:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
++ function bool test_pred_func_obj<T1>::operator  inlined
           |     }
           |   }

Looking at the loop annotations, there are four pertinent loops: 40, 53, 68, and 105.

Loops 40, 53, and 68 represent a recurrence. Loop 40 is stage 1 of the recurrence. It performs 3 instructions including 1 load and 1 store. Loop 54 is the communication phase for stage 1 of the recurrence. It performs 13 instructions including 1 load, 1 store, 1 reload, and 3 branches. Loop 68 is stage 2 of the recurrence. It performs 2 instructions including 1 load and 1 store.

Loop  40 in void count_adjacencies_higher_id_func_obj<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] in loop 26
       Stage 1 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		3 instructions, needs 44 streams for full utilization
		pipelined

Loop  54 in void count_adjacencies_higher_id_func_obj<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] in loop 26
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 13 instructions, 0 floating point operations
		1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls

Loop  68 in void count_adjacencies_higher_id_func_obj<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] in loop 26
       Stage 2 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		2 instructions, needs 90 streams for full utilization
		pipelined

Loop 105 is the main work performed by the body of the inner loop. It performs 6 instructions including 1 load, 1 store, and 2 branches. This is once again the same amount of work as the first version.

Loop 105 in void count_adjacencies_higher_id_func_obj<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] at line 144 in loop 93
       Loop not pipelined: not structurally ok
       Parallel section of loop from level 2
       Loop summary: 6 instructions, 0 floating point operations
		1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls

Algorithm 4 - Using the MTGL Graph API and a Function Object

Now, consider Algorithm 4 shown below. This is the same as Algorithm 3 (uses a function object), but it uses the syntactic sugar of the MTGL graph API. The graph used is the compressed_sparse_row_graph, which is the MTGL’s implementation of a compressed sparse row graph.

template <typename Graph, typename predicate>
void count_adjacencies_higher_id_mtgl(
    Graph& g, typename graph_traits<Graph>::size_type* indeg, predicate f)
{
  typedef typename graph_traits<Graph>::size_type size_type;
  typedef typename graph_traits<Graph>::vertex_descriptor vertex_descriptor;
  typedef typename graph_traits<Graph>::vertex_iterator vertex_iterator;
  typedef typename graph_traits<Graph>::adjacency_iterator adjacency_iterator;

  vertex_id_map<Graph> vid_map = get(_vertex_id_map, g);
  size_type order = num_vertices(g);
  vertex_iterator verts = vertices(g);

  #pragma mta assert noalias *indeg
  for (size_type i = 0; i < order; ++i)
  {
    vertex_descriptor u = verts[i];
    size_type uid = get(vid_map, u);
    adjacency_iterator adjs = adjacent_vertices(u, g);
    size_type end = out_degree(u, g);

    for (size_type j = 0; j < end; ++j)
    {
      vertex_descriptor v = adjs[j];
      size_type vid = get(vid_map, v);

      if (f(uid, vid)) mt_incr(indeg[vid], 1);
    }
  }
}

Looking at the canal output, we see that the compiler was still able to automatically parallelize the loop and perform a manhattan loop collapse.

           |   #pragma mta assert noalias *indeg
           |   for (size_type i = 0; i < order; ++i)
           |   {
           |     vertex_descriptor u = verts[i];
++ function T1 mtgl::detail::csr_vertex_iterator<T1, T2>::operator [] inlined
           |     size_type uid = get(vid_map, u);
++ function T2 mtgl::get<T1, T2, T3> inlined
  106 P:m  |     adjacency_iterator adjs = adjacent_vertices(u, g);
++ function mtgl::compressed_sparse_row_graph<T1>::adjacency_iterator mtgl::adjacent_vertices<T1> inlined
           |     size_type end = out_degree(u, g);
++ function mtgl::compressed_sparse_row_graph<T1>::size_type mtgl::out_degree<T1> inlined
           | 
           |     for (size_type j = 0; j < end; ++j)
           |     {
           |       vertex_descriptor v = adjs[j];
++ function T1 mtgl::detail::csr_adjacency_iterator<T1, T2>::operator [] inlined
  106 PP:m |       size_type vid = get(vid_map, v);
++ function T2 mtgl::get<T1, T2, T3> inlined
           | 
  106 PP:m |       if (f(uid, vid)) mt_incr(indeg[vid], 1);
++ function T1 mtgl::mt_incr<T1, T2> inlined
++ function bool test_pred_func_obj<T1>::operator  inlined
           |     }
           |   }

Looking at the loop annotations, there are four pertinent loops: 41, 55, 69, and 106.

Loops 41, 55, and 69 represent a recurrence. Loop 44 is stage 1 of the recurrence. It performs 3 instructions including 1 load and 1 store. Loop 55 is the communication phase for stage 1 of the recurrence. It performs 13 instructions including 1 load, 1 store, 1 reload, and 3 branches. Loop 69 is stage 2 of the recurrence. It performs 2 instructions including 1 load and 1 store.

Loop  41 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] in loop 27
       Stage 1 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		3 instructions, needs 44 streams for full utilization
		pipelined

Loop  55 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] in loop 27
       Stage 1 recurrence communication
       Loop not pipelined: not structurally ok
       Compiler generated
       Loop summary: 13 instructions, 0 floating point operations
		1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls

Loop  69 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] in loop 27
       Stage 2 of recurrence
       Compiler generated
       Loop summary: 1 loads, 1 stores, 0 floating point operations
		2 instructions, needs 90 streams for full utilization
		pipelined

Loop 106 is the main work performed by the body of the inner loop. It performs 6 instructions including 1 load, 1 store, and 2 branches. Thus, the MTGL version performed the same number of instructions and memory references as for this graph adapter as directly accessing the CSR structure in Algorithm 1. The compiler was able to eliminate the extra operations seemingly implied by the syntactic sugar, and it was able to automatically parallelize the loops involving the iterators. In this case we achieved our dual goals of generic code and good parallel performance.

Loop 106 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2)
[with T1=mtgl::compressed_sparse_row_graph<mtgl::directedS>,
      T2=test_pred_func_obj<mtgl::compressed_sparse_row_graph<mtgl::directedS>>] at line 167 in loop 94
       Loop not pipelined: not structurally ok
       Parallel section of loop from level 2
       Loop summary: 6 instructions, 0 floating point operations
		1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls

Algorithm 5 - Manual Load Balancing Using a Function Object

The operation applied to each adjacency in the above algorithms is quite simple. What happens if the operation becomes more complex? It is possible for a user to choose an operation that the compiler may not be able to parallelize well or at all. In this case parallelism could still be achieved by manually dividing the adjacencies into equal-sized chunks and assigning the chunks to threads. Consider the diagram, which represents Algorithm 5, shown below. The first row of boxes is the queue of vertices whose adjacencies we want to visit. The second row of boxes is the vertices array of the CSR graph. The third row of boxes is the adjacencies array of the CSR graph. Algorithm 5 performs the precomputation necessary to evenly divide the adjacencies into chunks and manually assigns those chunks to threads. This algorithm has overhead the others do not, but it allows parallelization in more circumstances.

Performance Results

The five algorithms presented for visiting the adjacencies of the vertices of a graph are tested against two graphs. The first is an Erdos-Renyi random graph of two billion edges. The second is a graph of realistic data with 0.5 billion edges. The Erdos-Renyi graph is an example of simple synthetic data, while the realistic data graph is an example of more complex data. The figures below show the time in seconds it took to perform the five algorithms on each of the graphs for varying numbers of processors. The results for the Erdos-Renyi graph are on the left while the results for the realistic data graph are on the right. The algorithms performed similarly for both the simple and realistic data. Algorithms 1, 3, and 4 performed roughly the same as each other, and the fully generic algorithm that manually chunks the adjacencies performed about two to three times slower than the better algorithms. The algorithm that used the function pointer (Algorithm 2), however, performed horribly. All the algorithms except for Algorithm 2 scaled up to 64 processors.

Breadth First Search

Breadth first search (BFS) is another common graph algorithm. It utilizes the kernel of performing an operation on each of the adjacencies of a list of vertices. The best performing algorithm for BFS is the one developed by Petr Konecny in 2007. Consider Figure the diagram below. Q is a circular queue that contains the search vertices. Threads read the current level of vertices from Q and also write the next level of vertices to Q. A is the virtual adjacency list for the current level of vertices in Q. Initialize Q with the source vertex. For each level of the search, divide the current level vertices in Q into equal-sized chunks. Each thread grabs the next unprocessed vertex chunk and the next output chunk in Q. Each thread visits the adjacencies of every vertex in its input chunk, writing the next level of vertices to its output chunk. Threads grab new output chunks as needed and fill the unused portion of their output chunks with a marker to indicate no vertex. When a thread encounters a no vertex marker in its input chunk, it skips to the next entry.

Performance Results

We tested two versions of Petr’s algorithm. The first is a C implementation, and the second is an MTGL-ized version. We also tested a breadth first search algorithm called visit adj that uses Algorithm 5 described above to divide the adjacencies between the threads at each search level. The three algorithms were tested against the Erdos-Renyi graph and the realistic data graph. The diagrams below show the time in seconds it took to perform the three algorithms on each of the graphs for varying numbers of processors. The results for the Erdos-Renyi graph are on the left while the results for the realistic data graph are on the right.

For the simple data, the C and MTGL versions of Petr’s algorithm perform roughly the same and scale up to 64 processors. The visit adj algorithm initially performs about two times slower than the other two algorithms, but it does not scale very well. For the realistic data, the MTGL version of Petr’s algorithm achieves no speedup from one processor. The visit adj algorithm performs much better for this data set, and it scales up to 16 processors.

The realistic data tests reveal the one known issue with Petr’s algorithm. The algorithm chunks only the vertices and not their adjacencies. When the degree of the vertices does not vary too widely or there are many vertices in the search level, this is not an issue. The similar degree vertices or a large number of vertices allow for good load balancing. If a graph has a few vertices of extremely high degree while most of the vertices have small degree, however, Petr’s algorithm chokes on the high degree vertices if they are discovered in early levels of the search and essentially serializes the level with the high degree vertex. The realistic data has this property, and the visit adj algorithm that manually load balances the number of adjacencies assigned to each thread performs much better.

Attachments