Changes between Version 4 and Version 5 of BreadthFirstSearch


Ignore:
Timestamp:
02/22/10 23:53:45 (9 years ago)
Author:
gemacke
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BreadthFirstSearch

    v4 v5  
    2424} 
    2525}}} 
     26 
     27The algorithm is the simple C implementation of traversing the adjacencies of all the vertices in a graph to determine for each vertex the number of adjacent vertices that have a higher id. It is a simple nested loop whose body is an simple if statement. 
     28 
     29Now look at the canal output. 
    2630 
    2731{{{ 
     
    3943}}} 
    4044 
    41 The algorithm is the simple C implementation of traversing the adjacencies of all the vertices in a graph to determine for each vertex the number of adjacent vertices that have a higher id. It is a simple nested loop whose body is an simple if statement. Note the PP:m marking to the left of the if statement. The capital P ’s mean the compiler automatically parallelized both levels of the loops. The m means the compiler automatically performed a manhattan loop collapse. The body of the loop performs only two memory references and a total of six instructions. 
     45Note the PP:m marking to the left of the if statement. The capital P ’s mean the compiler automatically parallelized both levels of the loops. The m means the compiler automatically performed a manhattan loop collapse. 
     46 
     47Looking at the annotation for loop 4, we see that the body of the loop performs only two memory references and a total of six instructions. 
    4248 
    4349{{{ 
     
    4955}}} 
    5056 
    51 But what happens if a user desires to do something more complicated? Consider Algorithm 2 shown in Figure 2. This algorithm does the same thing as Algorithm 1, but it performs the id test by calling a function via a function pointer. A function pointer is the natural C method to allow a user to perform a generic operation on each of the adjacencies. Unfortunately, the compiler cannot inline the function called by the function pointer, and the number of instructions and memory references increases significantly. In addition, note the pp:m marking to the left of the function call. The small p’s mean that the compiler could not automatically parallelize the loop but that the user forced parallelization using a pragma statement. 
     57But what happens if a user desires to do something more complicated? Consider Algorithm 2 shown below. 
    5258 
    5359{{{ 
     
    8187} 
    8288}}} 
     89 
     90This algorithm does the same thing as Algorithm 1, but it performs the id test by calling a function via a function pointer. A function pointer is the natural C method to allow a user to perform a generic operation on each of the adjacencies. 
     91 
     92Looking at the canal output and the annotation for loop 6, we see that the compiler cannot inline the function called by the function pointer, and the number of instructions and memory references increases significantly. In addition, note the pp:m marking to the left of the function call. The small p’s mean that the compiler could not automatically parallelize the loop but that the user forced parallelization using a pragma statement. 
    8393 
    8494{{{ 
     
    110120}}} 
    111121 
    112 When programming in C++, a natural method to allow a user to perform a generic operation is a function object. Consider Algorithm 3 shown in Figure 3. This algorithm is the same as the previous ones except it performs the id test using a function object. Once again the compiler was able to automatically parallelize the loop and perform a manhattan loop collapse. The number of memory references and instructions is the same as Algorithm 1. 
     122When programming in C++, a natural method to allow a user to perform a generic operation is a function object. Consider Algorithm 3 shown below.  This algorithm is the same as the previous ones except it performs the id test using a function object. 
    113123 
    114124{{{ 
     
    137147} 
    138148}}} 
     149 
     150Looking at the canal output, we see that once again the compiler was able to automatically parallelize the loop and perform a manhattan loop collapse. 
    139151 
    140152{{{ 
     
    152164}}} 
    153165 
     166Looking at the annotation for loop 24, we see that the number of memory references and instructions is the same as Algorithm 1. 
     167 
    154168{{{ 
    155169Loop  24 in void compute_in_degree_func_obj<T1>(mtgl::static_graph<(int)1> *, unsigned long *, T1) 
     
    161175}}} 
    162176 
    163 Consider Algorithm 4 shown in Figure 4. This is the same as Algorithm 3 (uses a function object), but it uses the syntactic sugar of the MTGL graph API. The graph adapter used is a static graph adapter, which is the MTGL’s implementation of a compressed sparse row graph. For this graph adapter, the MTGL version performed the same number of instructions and memory references as Algorithm 1. The compiler was able to eliminate the extra operations seemingly implied by the syntactic sugar, and it was able to automatically parallelize the loops involving the iterators. In this case we achieved our dual goals of generic code and good parallel performance. 
     177Now, consider Algorithm 4 shown below. This is the same as Algorithm 3 (uses a function object), but it uses the syntactic sugar of the MTGL graph API. The graph adapter used is a static graph adapter, which is the MTGL’s implementation of a compressed sparse row graph. 
    164178 
    165179{{{ 
     
    193207} 
    194208}}} 
     209 
     210Looking at the canal output and the annotation for loop 25, we see that the MTGL version performed the same number of instructions and memory references as for this graph adapter as Algorithm 1. The compiler was able to eliminate the extra operations seemingly implied by the syntactic sugar, and it was able to automatically parallelize the loops involving the iterators. In this case we achieved our dual goals of generic code and good parallel performance. 
    195211 
    196212{{{ 
     
    225241}}} 
    226242 
    227 The operation applied to each adjacency in the above algorithms is quite simple. What happens if the operation becomes more complex? It is possible for a user to choose an operation that the compiler may not be able to parallelize well or at all. In this case parallelism could still be achieved by manually dividing the adjacencies into equal-sized chunks and assigning the chunks to threads. Consider the diagram, which represents Algorithm 5, shown in Figure 5. The first row of boxes is the queue of vertices whose adjacencies we want to visit. The second row of boxes is the vertices array of the CSR graph. The third row of boxes is the adjacencies array of the CSR graph. Algorithm 5 performs the precomputation necessary to evenly divide the adjacencies into chunks and manually assigns those chunks to threads. This algorithm has overhead the others do not, but it allows parallelization in more circumstances. 
     243The operation applied to each adjacency in the above algorithms is quite simple. What happens if the operation becomes more complex? It is possible for a user to choose an operation that the compiler may not be able to parallelize well or at all. In this case parallelism could still be achieved by manually dividing the adjacencies into equal-sized chunks and assigning the chunks to threads. Consider the diagram, which represents Algorithm 5, shown below. The first row of boxes is the queue of vertices whose adjacencies we want to visit. The second row of boxes is the vertices array of the CSR graph. The third row of boxes is the adjacencies array of the CSR graph. Algorithm 5 performs the precomputation necessary to evenly divide the adjacencies into chunks and manually assigns those chunks to threads. This algorithm has overhead the others do not, but it allows parallelization in more circumstances. 
    228244 
    229245[[Image(mtgl_alg5.png, 60%, align=center)]] 
     
    231247== Performance Results == 
    232248 
    233 The five algorithms presented for visiting the adjacencies of the vertices of a graph are tested against two graphs. The first is an Erdos-Renyi random graph of two billion edges. The second is a graph of realistic data with 0.5 billion edges. The Erdos-Renyi graph is an example of simple synthetic data, while the realistic data graph is an example of more complex data. Figure 6 shows the time in seconds it took to perform the five algorithms on each of the graphs for varying numbers of processors. The algorithms performed similarly for both the simple and realistic data. Algorithms 1, 3, and 4 performed roughly the same as each other, and the fully generic algorithm that manually chunks the adjacencies performed about two to three times slower than the better algorithms. The algorithm that used the function pointer (Algorithm 2), however, performed horribly. All the algorithms except for Algorithm 2 scaled up to 64 processors. 
     249The five algorithms presented for visiting the adjacencies of the vertices of a graph are tested against two graphs. The first is an Erdos-Renyi random graph of two billion edges. The second is a graph of realistic data with 0.5 billion edges. The Erdos-Renyi graph is an example of simple synthetic data, while the realistic data graph is an example of more complex data. The figures below show the time in seconds it took to perform the five algorithms on each of the graphs for varying numbers of processors.  The results for the Erdos-Renyi graph are on the left while the results for the realistic data graph are on the right. The algorithms performed similarly for both the simple and realistic data. Algorithms 1, 3, and 4 performed roughly the same as each other, and the fully generic algorithm that manually chunks the adjacencies performed about two to three times slower than the better algorithms. The algorithm that used the function pointer (Algorithm 2), however, performed horribly. All the algorithms except for Algorithm 2 scaled up to 64 processors. 
    234250 
    235251[[Image(adjlist_performance_er.png, 50%, align=left)]] 
    236 [[Image(adjlist_performance_rd.png, 50%, align=left)]] 
    237 <BR> 
     252[[Image(adjlist_performance_rd.png, 50%)]] 
    238253 
    239254= Breadth First Search = 
    240255 
    241 Breadth first search (BFS) is another common graph algorithm. It utilizes the kernel of performing an operation on each of the adjacencies of a list of vertices. The best performing algorithm for BFS is the one developed by Petr Konecny in 2007. Consider Figure 7. Q is a circular queue that contains the search vertices. Threads read the current level of vertices from Q and also write the next level of vertices to Q. A is the virtual adjacency list for the current level of vertices in Q. Initialize Q with the source vertex. For each level of the search, divide the current level vertices in Q into equal-sized chunks. Each thread grabs the next unprocessed vertex chunk and the next output chunk in Q. Each thread visits the adjacencies of every vertex in its input chunk, writing the next level of vertices to its output chunk. Threads grab new output chunks as needed and fill the unused portion of their output chunks with a marker to indicate no vertex. When a thread encounters a no vertex marker in its input chunk, it skips to the next entry. 
     256Breadth first search (BFS) is another common graph algorithm. It utilizes the kernel of performing an operation on each of the adjacencies of a list of vertices. The best performing algorithm for BFS is the one developed by Petr Konecny in 2007. Consider Figure the diagram below. Q is a circular queue that contains the search vertices. Threads read the current level of vertices from Q and also write the next level of vertices to Q. A is the virtual adjacency list for the current level of vertices in Q. Initialize Q with the source vertex. For each level of the search, divide the current level vertices in Q into equal-sized chunks. Each thread grabs the next unprocessed vertex chunk and the next output chunk in Q. Each thread visits the adjacencies of every vertex in its input chunk, writing the next level of vertices to its output chunk. Threads grab new output chunks as needed and fill the unused portion of their output chunks with a marker to indicate no vertex. When a thread encounters a no vertex marker in its input chunk, it skips to the next entry. 
     257 
     258[[Image(konecny_bfs.png, 50%)]] 
    242259 
    243260== Performance Results == 
    244261 
    245 We tested two versions of Petr’s algorithm. The first is a C implementation, and the second is an MTGL-ized version. We also tested a breadth first search algorithm called visit adj that uses Algorithm 5 from Section 9.1 to divide the adjacencies between the threads at each search level. The three algorithms were tested against the Erdos-Renyi graph and the realistic data graph. Figure 8 shows the time in seconds it took to perform the three algorithms on each of the graphs for varying numbers of processors. 
     262We tested two versions of Petr’s algorithm. The first is a C implementation, and the second is an MTGL-ized version. We also tested a breadth first search algorithm called visit adj that uses Algorithm 5 described above to divide the adjacencies between the threads at each search level. The three algorithms were tested against the Erdos-Renyi graph and the realistic data graph. The diagrams below show the time in seconds it took to perform the three algorithms on each of the graphs for varying numbers of processors.  The results for the Erdos-Renyi graph are on the left while the results for the realistic data graph are on the right. 
     263 
     264[[Image(bfs_performance_er.png, 50%, align=left)]] 
     265[[Image(bfs_performance_rd.png, 50%)]] 
    246266 
    247267For the simple data, the C and MTGL versions of Petr’s algorithm perform roughly the same and scale up to 64 processors. The visit adj algorithm initially performs about two times slower than the other two algorithms, but it does not scale very well. For the realistic data, the MTGL version of Petr’s algorithm achieves no speedup from one processor. The visit adj algorithm performs much better for this data set, and it scales up to 16 processors. 
    248268 
    249269The realistic data tests reveal the one known issue with Petr’s algorithm. The algorithm chunks only the vertices and not their adjacencies. When the degree of the vertices does not vary too widely or there are many vertices in the search level, this is not an issue. The similar degree vertices or a large number of vertices allow for good load balancing. If a graph has a few vertices of extremely high degree while most of the vertices have small degree, however, Petr’s algorithm chokes on the high degree vertices if they are discovered in early levels of the search and essentially serializes the level with the high degree vertex. The realistic data has this property, and the visit adj algorithm that manually load balances the number of adjacencies assigned to each thread performs much better. 
    250  
    251 [[Image(bfs_performance_er.png, 50%, align=left)]] 
    252 [[Image(bfs_performance_rd.png, 50%, align=left)]]