Changes between Version 5 and Version 6 of BreadthFirstSearch


Ignore:
Timestamp:
04/22/10 12:33:55 (9 years ago)
Author:
gemacke
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BreadthFirstSearch

    v5 v6  
    66 
    77{{{ 
    8 void compute_in_degree(static_graph<directedS>* g, size_type* indeg) 
     8void count_adjacencies_higher_id(static_graph<directedS>* g, size_type* indeg) 
    99{ 
    1010  size_type i, j; 
     
    2727The algorithm is the simple C implementation of traversing the adjacencies of all the vertices in a graph to determine for each vertex the number of adjacent vertices that have a higher id. It is a simple nested loop whose body is an simple if statement. 
    2828 
    29 Now look at the canal output. 
     29Now look at the canal output for the nested loop. 
    3030 
    3131{{{ 
     
    3737           |     for (j = begin; j < end; j++) 
    3838           |     { 
    39     4 PP:m |       if (i < end_points[j]) mt_incr(indeg[end_points[j]], 1); 
     39    8 PP:m |       if (i < end_points[j]) mt_incr(indeg[end_points[j]], 1); 
    4040++ function T1 mtgl::mt_incr<T1, T2> inlined 
    4141           |     } 
     
    4545Note the PP:m marking to the left of the if statement. The capital P ’s mean the compiler automatically parallelized both levels of the loops. The m means the compiler automatically performed a manhattan loop collapse. 
    4646 
    47 Looking at the annotation for loop 4, we see that the body of the loop performs only two memory references and a total of six instructions. 
    48  
    49 {{{ 
    50 Loop   4 in compute_in_degree(mtgl::static_graph<(int)1> *, unsigned long *) at line 64 in loop 3 
     47To see the real cost of this loop, we need to search through the loop annotations to see what loops (including compiler generated ones) there are for the function count_adjacencies_higher_id().  There are four pertinent loops: 3, 4, 5, and 8. 
     48 
     49Loops 3, 4, and 5 represent a recurrence.  Loop 3 is stage 1 of the recurrence.  It performs 3 instructions including 1 load and 1 store.  Loop 4 is the communication phase for stage 1 of the recurrence.  It performs 13 instructions including 1 load, 1 store, 1 reload, and 3 branches.  Loop 5 is stage 2 of the recurrence.  It performs 2 instructions including 1 load and 1 store. 
     50 
     51{{{ 
     52Loop   3 in count_adjacencies_higher_id(mtgl::static_graph<(int)1> *, unsigned long *) in loop 2 
     53       Stage 1 of recurrence 
     54       Compiler generated 
     55       Loop summary: 1 loads, 1 stores, 0 floating point operations 
     56                3 instructions, needs 44 streams for full utilization 
     57                pipelined 
     58 
     59Loop   4 in count_adjacencies_higher_id(mtgl::static_graph<(int)1> *, unsigned long *) in loop 2 
     60       Stage 1 recurrence communication 
     61       Loop not pipelined: not structurally ok 
     62       Compiler generated 
     63       Loop summary: 13 instructions, 0 floating point operations 
     64                1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls 
     65 
     66Loop   5 in count_adjacencies_higher_id(mtgl::static_graph<(int)1> *, unsigned long *) in loop 2 
     67       Stage 2 of recurrence 
     68       Compiler generated 
     69       Loop summary: 1 loads, 1 stores, 0 floating point operations 
     70                2 instructions, needs 90 streams for full utilization 
     71                pipelined 
     72}}} 
     73 
     74Loop 8 is the main work performed by the body of the inner loop.  It performs 6 instructions including 1 load, 1 store, and 2 branches. 
     75 
     76{{{ 
     77Loop   8 in count_adjacencies_higher_id(mtgl::static_graph<(int)1> *, unsigned long *) at line 64 in loop 7 
    5178       Loop not pipelined: not structurally ok 
    5279       Parallel section of loop from level 2 
    5380       Loop summary: 6 instructions, 0 floating point operations 
    54     1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls 
     81                1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls 
    5582}}} 
    5683 
     
    6693} 
    6794 
    68 void compute_in_degree_func_ptr(static_graph<directedS>* g, 
    69                                 size_type* indeg, pred_t my_func) 
     95void count_adjacencies_higher_id_func_ptr(static_graph<directedS>* g, 
     96                                          size_type* indeg, pred_t my_func) 
    7097{ 
    7198  size_type i, j; 
     
    88115}}} 
    89116 
    90 This algorithm does the same thing as Algorithm 1, but it performs the id test by calling a function via a function pointer. A function pointer is the natural C method to allow a user to perform a generic operation on each of the adjacencies. 
    91  
    92 Looking at the canal output and the annotation for loop 6, we see that the compiler cannot inline the function called by the function pointer, and the number of instructions and memory references increases significantly. In addition, note the pp:m marking to the left of the function call. The small p’s mean that the compiler could not automatically parallelize the loop but that the user forced parallelization using a pragma statement. 
     117This algorithm does the same thing as Algorithm 1, but it performs the id test by calling a function via a function pointer. The function test_pred_func is the function passed to count_adjacencies_higher_id_func_ptr() later in the code.  A function pointer is the natural C method to allow a user to perform a generic operation on each of the adjacencies. 
     118 
     119Looking at the canal output, we see that the compiler doesn't inline the function called by the function pointer, and we have to use assert parallel pragmas to get the compiler to parallelize the code.  This is noted by the pp:m marking to the left of the function call.  The small p’s mean that the compiler could not automatically parallelize the loop but that the user forced parallelization using a pragma statement.  The number of instructions and memory references increases significantly. 
    93120 
    94121{{{ 
    95122           |   #pragma mta assert noalias *indeg 
    96123           |   #pragma mta assert parallel 
    97            |   for (i = 0; i < order; i++)  
    98            |   {  
     124           |   for (i = 0; i < order; i++) 
     125           |   { 
    99126           |     size_type begin = index[i]; 
    100127           |     size_type end = index[i + 1]; 
     
    102129           |     for (j = begin; j < end; j++) 
    103130           |     { 
    104     6 pp:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);         
     131   14 pp:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1); 
    105132++ function T1 mtgl::mt_incr<T1, T2> inlined 
    106133           |     } 
     
    108135}}} 
    109136 
    110 {{{ 
    111 Loop   6 in compute_in_degree_func_ptr(mtgl::static_graph<(int)1> *, unsigned long *, 
    112 bool (*)(unsigned long, unsigned long)) at line 93 in region 5 
     137Looking at the loop annotations, there are four pertinent loops: 11, 12, 13, and 14. 
     138 
     139Loops 11, 12, and 13 represent a recurrence.  Loop 11 is stage 1 of the recurrence.  It performs 3 instructions including 1 load and 1 store.  Loop 12 is the communication phase for stage 1 of the recurrence.  It performs 13 instructions including 1 load, 1 store, 1 reload, and 3 branches.  Loop 13 is stage 2 of the recurrence.  It performs 2 instructions including 1 load and 1 store. 
     140 
     141{{{ 
     142Loop  11 in count_adjacencies_higher_id_func_ptr(mtgl::static_graph<(int)1> *, unsigned long *, 
     143                                                 bool (*)(unsigned long, unsigned long)) in loop 10 
     144       Stage 1 of recurrence 
     145       Compiler generated 
     146       Loop summary: 1 loads, 1 stores, 0 floating point operations 
     147                3 instructions, needs 44 streams for full utilization 
     148                pipelined 
     149 
     150Loop  12 in count_adjacencies_higher_id_func_ptr(mtgl::static_graph<(int)1> *, unsigned long *, 
     151                                                 bool (*)(unsigned long, unsigned long)) in loop 10 
     152       Stage 1 recurrence communication 
     153       Loop not pipelined: not structurally ok 
     154       Compiler generated 
     155       Loop summary: 13 instructions, 0 floating point operations 
     156                1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls 
     157 
     158Loop  13 in count_adjacencies_higher_id_func_ptr(mtgl::static_graph<(int)1> *, unsigned long *, 
     159                                                 bool (*)(unsigned long, unsigned long)) in loop 10 
     160       Stage 2 of recurrence 
     161       Compiler generated 
     162       Loop summary: 1 loads, 1 stores, 0 floating point operations 
     163                2 instructions, needs 90 streams for full utilization 
     164                pipelined 
     165}}} 
     166 
     167Loop 14 is the main work performed by the body of the inner loop.  It performs 25 instructions including 4 loads, 1 store, 12 reloads, and 2 branches.  This is a significant increase in instructions and memory operations over the previous version! 
     168 
     169{{{ 
     170Loop  14 in count_adjacencies_higher_id_func_ptr(mtgl::static_graph<(int)1> *, unsigned long *, 
     171                                                 bool (*)(unsigned long, unsigned long)) at line 99 in region 9 
    113172       In parallel phase 3 
    114173       Interleave scheduled 
     
    117176       Loop from level 2 combined with this loop 
    118177       Loop summary: 25 instructions, 0 floating point operations 
    119     4 loads, 1 stores, 12 reloads, 0 spills, 2 branches, 2 calls 
    120 }}} 
    121  
    122 When programming in C++, a natural method to allow a user to perform a generic operation is a function object. Consider Algorithm 3 shown below.  This algorithm is the same as the previous ones except it performs the id test using a function object. 
    123  
    124 {{{ 
    125 class test_predC { 
     178                4 loads, 1 stores, 12 reloads, 0 spills, 2 branches, 2 calls 
     179}}} 
     180 
     181When programming in C++, a natural method to allow a user to perform a generic operation is a function object. Consider Algorithm 3 shown below.  This algorithm is the same as the previous ones except it performs the id test using a function object.  The function object test_pred_func_obj is passed to count_adjacencies_higher_id_func_obj later in the code. 
     182 
     183{{{ 
     184class test_pred_func_obj { 
    126185public: 
    127186  inline bool operator()(size_type i, size_type j) { return (i < j); } 
     
    129188 
    130189template <class pred> 
    131 void compute_in_degree_func_obj(static_graph<directedS>* g, 
    132                                 size_type* indeg, pred my_func) 
     190void count_adjacencies_higher_id_func_obj(static_graph<directedS>* g, 
     191                                          size_type* indeg, pred my_func) 
    133192{ 
    134193  size_type order = g->n; 
     
    158217           |     for (size_type j = begin; j < end; j++) 
    159218           |     { 
    160    24 PP:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1);         
     219   90 PP:m |       if (my_func(i, end_points[j])) mt_incr(indeg[end_points[j]], 1); 
    161220++ function T1 mtgl::mt_incr<T1, T2> inlined 
    162 ++ function test_predC::operator  inlined           |     } 
     221++ function test_pred_func_obj::operator  inlined 
     222           |     } 
    163223           |   } 
    164224}}} 
    165225 
    166 Looking at the annotation for loop 24, we see that the number of memory references and instructions is the same as Algorithm 1. 
    167  
    168 {{{ 
    169 Loop  24 in void compute_in_degree_func_obj<T1>(mtgl::static_graph<(int)1> *, unsigned long *, T1) 
    170 [with T1=test_predC] at line 117 in loop 21 
     226Looking at the loop annotations, there are four pertinent loops: 43, 53, 63, and 90. 
     227 
     228Loops 43, 53, and 63 represent a recurrence.  Loop 43 is stage 1 of the recurrence.  It performs 3 instructions including 1 load and 1 store.  Loop 53 is the communication phase for stage 1 of the recurrence.  It performs 13 instructions including 1 load, 1 store, 1 reload, and 3 branches.  Loop 63 is stage 2 of the recurrence.  It performs 2 instructions including 1 load and 1 store. 
     229 
     230{{{ 
     231Loop  43 in void count_adjacencies_higher_id_func_obj<T1>(mtgl::static_graph<(int)1> *, unsigned long *, T1) 
     232[with T1=test_pred_func_obj] in loop 33 
     233       Stage 1 of recurrence 
     234       Compiler generated 
     235       Loop summary: 1 loads, 1 stores, 0 floating point operations 
     236                3 instructions, needs 44 streams for full utilization 
     237                pipelined 
     238 
     239Loop  53 in void count_adjacencies_higher_id_func_obj<T1>(mtgl::static_graph<(int)1> *, unsigned long *, T1) 
     240[with T1=test_pred_func_obj] in loop 33 
     241       Stage 1 recurrence communication 
     242       Loop not pipelined: not structurally ok 
     243       Compiler generated 
     244       Loop summary: 13 instructions, 0 floating point operations 
     245                1 loads, 1 stores, 1 reloads, 0 spills, 3 branches, 0 calls 
     246 
     247Loop  63 in void count_adjacencies_higher_id_func_obj<T1>(mtgl::static_graph<(int)1> *, unsigned long *, T1) 
     248[with T1=test_pred_func_obj] in loop 33 
     249       Stage 2 of recurrence 
     250       Compiler generated 
     251       Loop summary: 1 loads, 1 stores, 0 floating point operations 
     252                2 instructions, needs 90 streams for full utilization 
     253                pipelined 
     254}}} 
     255 
     256Loop 90 is the main work performed by the body of the inner loop.  It performs 6 instructions including 1 load, 1 store, and 2 branches.  This is once again the same amount of work as the first version. 
     257 
     258{{{ 
     259Loop  90 in void count_adjacencies_higher_id_func_obj<T1>(mtgl::static_graph<(int)1> *, unsigned long *, T1) 
     260[with T1=test_pred_func_obj] at line 123 in loop 81 
    171261       Loop not pipelined: not structurally ok 
    172262       Parallel section of loop from level 2 
    173263       Loop summary: 6 instructions, 0 floating point operations 
    174     1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls 
     264                1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls 
    175265}}} 
    176266 
     
    179269{{{ 
    180270template <class Graph, class predicate> 
    181 void compute_in_degree_mtgl(Graph& g, 
    182                             typename graph_traits<Graph>::size_type* indeg, 
    183                             predicate f) 
     271void count_adjacencies_higher_id_mtgl( 
     272    Graph& g, typename graph_traits<Graph>::size_type* indeg, predicate f) 
    184273{ 
    185274  typedef typename graph_traits<Graph>::size_type size_type; 
     
    208297}}} 
    209298 
    210 Looking at the canal output and the annotation for loop 25, we see that the MTGL version performed the same number of instructions and memory references as for this graph adapter as Algorithm 1. The compiler was able to eliminate the extra operations seemingly implied by the syntactic sugar, and it was able to automatically parallelize the loops involving the iterators. In this case we achieved our dual goals of generic code and good parallel performance. 
    211  
    212 {{{ 
    213            |   #pragma mta assert noalias *indeg           |   for (size_type i = 0; i < order; i++) 
     299Looking at the canal output, we see that the compiler was still able to automatically parallelize the loop and perform a manhattan loop collapse. 
     300 
     301{{{ 
     302           |   #pragma mta assert noalias *indeg 
     303           |   for (size_type i = 0; i < order; i++) 
    214304           |   { 
    215            |     vertex_t u = verts[i];++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined 
    216            |     size_type uid = get(vid_map, u);++ function T2 mtgl::get<T1, T2, T3> inlined 
    217    25 P:m  |     adj_iter_t adjs = adjacent_vertices(u, g); 
     305           |     vertex_t u = verts[i]; 
     306++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined 
     307           |     size_type uid = get(vid_map, u); 
     308++ function T2 mtgl::get<T1, T2, T3> inlined 
     309   91 P:m  |     adj_iter_t adjs = adjacent_vertices(u, g); 
    218310++ function mtgl::static_graph_adapter<N1>::adjacency_iterator mtgl::adjacent_vertices<N1> inlined 
    219            |     size_type end = out_degree(u, g); 
     311   64 P:e  |     size_type end = out_degree(u, g); 
    220312++ function mtgl::static_graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined 
    221313           |     for (size_type j = 0; j < end; j++) 
     
    223315           |       vertex_t v = adjs[j]; 
    224316++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined 
    225    25 PP:m |       size_type vid = get(vid_map, v); 
     317   91 PP:m |       size_type vid = get(vid_map, v); 
    226318++ function T2 mtgl::get<T1, T2, T3> inlined 
    227    25 PP:m |       if (f(uid, vid)) mt_incr(indeg[vid], 1); 
     319   91 PP:m |       if (f(uid, vid)) mt_incr(indeg[vid], 1); 
    228320++ function T1 mtgl::mt_incr<T1, T2> inlined 
    229 ++ function test_predC::operator  inlined 
     321++ function test_pred_func_obj::operator  inlined 
    230322           |     } 
    231323           |   } 
    232324}}} 
    233325 
    234 {{{ 
    235 Loop  25 in void compute_in_degree_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2) 
    236 [with T1=mtgl::static_graph_adapter<(int)1>, T2=test_predC] at line 140 in loop 22 
     326Looking at the loop annotations, there are four pertinent loops: 44, 54, 64, and 91. 
     327 
     328Loops 44, 54, and 64 represent a recurrence.  Loop 44 is stage 1 of the recurrence.  It performs 2 instructions including 1 load.  Loop 54 is the communication phase for stage 1 of the recurrence.  It performs 11 instructions including 1 load, 1 store, and 3 branches.  Loop 64 is stage 2 of the recurrence.  It performs 3 instructions including 1 load and 2 stores.  Stage 1 of the recurrence requires fewer instructions and memory operations than Algorithm 1, but stage 2 requires 1 more instruction and memory operation than Algorithm 1. 
     329 
     330{{{ 
     331Loop  44 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2) 
     332[with T1=mtgl::static_graph_adapter<(int)1>, T2=test_pred_func_obj] at line 146 in region 20 
     333       Stage 1 of recurrence 
     334       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     335                2 instructions, needs 45 streams for full utilization 
     336                pipelined 
     337 
     338Loop  54 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2) 
     339[with T1=mtgl::static_graph_adapter<(int)1>, T2=test_pred_func_obj] in region 20 
     340       Stage 1 recurrence communication 
     341       Loop not pipelined: not structurally ok 
     342       Compiler generated 
     343       Loop summary: 11 instructions, 0 floating point operations 
     344                1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls 
     345 
     346Loop  64 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2) 
     347[with T1=mtgl::static_graph_adapter<(int)1>, T2=test_pred_func_obj] at line 146 in region 20 
     348       Stage 2 of recurrence 
     349       Loop summary: 1 loads, 2 stores, 0 floating point operations 
     350                3 instructions, needs 60 streams for full utilization 
     351                pipelined 
     352}}} 
     353 
     354Loop 91 is the main work performed by the body of the inner loop.  It performs 6 instructions including 1 load, 1 store, and 2 branches.  Thus, the MTGL version performed the same number of instructions and memory references as for this graph adapter as directly accessing the CSR structure in Algorithm 1.  The compiler was able to eliminate the extra operations seemingly implied by the syntactic sugar, and it was able to automatically parallelize the loops involving the iterators. In this case we achieved our dual goals of generic code and good parallel performance. 
     355 
     356{{{ 
     357Loop  91 in void count_adjacencies_higher_id_mtgl<T1, T2>(T1 &, mtgl::graph_traits<T1>::size_type *, T2) 
     358[with T1=mtgl::static_graph_adapter<(int)1>, T2=test_pred_func_obj] at line 145 in loop 82 
    237359       Loop not pipelined: not structurally ok 
    238360       Parallel section of loop from level 2 
    239361       Loop summary: 6 instructions, 0 floating point operations 
    240     1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls 
     362                1 loads, 1 stores, 0 reloads, 0 spills, 2 branches, 0 calls 
    241363}}} 
    242364