Changes between Version 31 and Version 32 of FirstMtglExample


Ignore:
Timestamp:
04/14/10 13:35:13 (9 years ago)
Author:
gemacke
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FirstMtglExample

    v31 v32  
    4444}}} 
    4545 
    46 Next, we want to create an instance of a static_graph and initialize it. 
     46Next, we want to create an instance of a static_graph, initialize it, and delete the temporary memory of srcs and dests. 
    4747 
    4848{{{ 
     
    5050  static_graph<directedS> g; 
    5151  g.init(n, m, srcs, dests); 
     52 
     53  delete [] srcs; 
     54  delete [] dests; 
    5255}}} 
    5356 
     
    101104}}} 
    102105 
    103 If we look at the code using the XMT's canal (Compiler ANALysis) tool, we see that the compiler performed a scalar to vector expansion (P:e) for begin and end.  It performed a manhattan loop collapse (PP:m), removing the reduction entirely from the loop. 
     106The XMT programming environment provides the canal (Compiler ANALysis) tool to let a user view how the compiler parallelized their code.  To use the command-line version of canal for our programs, issue the following command, which places the canal output in the file example1.canal. 
     107 
     108{{{ 
     109canal -pl example1.pl example1.cpp > example1.canal 
     110}}} 
     111 
     112Looking at the canal output, we see that the compiler performed a scalar-to-vector expansion (P:e) for begin and end.  It performed a manhattan loop collapse (PP:m), removing the reduction entirely from the loop.  It also atomicized the access to id_sum to make the increment operation thread-safe.  This is necessary because id_sum is both an LHS and an RHS even though that is hidden by the syntax of the += operator. 
    104113 
    105114{{{ 
    106115            |   for (unsigned long u = 0; u < n; ++u) 
    107116            |   { 
    108    20 P     |     unsigned long begin = g.index[u]; 
    109    19 P:e   + 
    110    20 P     |     unsigned long end = g.index[u+1]; 
    111    19 P:e   + 
    112             | 
    113    23 PP:m$ |     for (unsigned long j = begin; j < end; ++j) id_sum += g.end_points[j]; 
     117  169 P     |     unsigned long begin = g.index[u]; 
     118  157 P:e   + 
     119  169 P     |     unsigned long end = g.index[u+1]; 
     120  157 P:e   + 
     121            |  
     122  185 PP:m$ |     for (unsigned long j = begin; j < end; ++j) id_sum += g.end_points[j]; 
    114123** reduction moved out of 2 loops 
    115124            |   } 
    116125}}} 
    117126 
    118 Looking at the annotation for loop 20, we see that 4 instructions and 4 memory operations will be performed for the scalar to vector expansion. 
    119  
    120 {{{ 
    121 Loop  20 in main at line 61 in region 18 
     127Diving into the canal loop annotations, we see that Parallel region 145 contains all of the translated code to execute the nested for loops.  The recurrence was removed from the loops, so it's separate from the execution of the loops but in the same parallel region.  There are several important loops to look at in this region, including one that is shown only in the additional loop details section and not in the annotated source code listing. 
     128 
     129Loop 157 shows stage 1 of the recurrence.  Looking at its annotation, we see that it takes 2 instructions with 1 load. 
     130 
     131{{{ 
     132Loop 157 in main at line 62 in region 145 
     133       Stage 1 of recurrence 
     134       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     135                2 instructions, needs 45 streams for full utilization 
     136                pipelined 
     137}}} 
     138 
     139Loop 163, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing.  Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches. 
     140 
     141{{{ 
     142Loop 163 in main in region 145 
     143       Stage 1 recurrence communication 
     144       Loop not pipelined: not structurally ok 
     145       Compiler generated 
     146       Loop summary: 11 instructions, 0 floating point operations 
     147                1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls 
     148}}} 
     149 
     150Loop 169 is stage 2 of the recurrence.  Looking at its annotation we see that it takes 4 instructions with 1 load and 3 stores. 
     151 
     152{{{ 
     153Loop 169 in main at line 62 in region 145 
    122154       Stage 2 of recurrence 
    123        Loop summary: 4 memory operations, 0 floating point operations 
    124     4 instructions, needs 45 streams for full utilization 
    125     pipelined 
    126 }}} 
    127  
    128 Looking at the annotation for loop 23, we see that there's only 1 instruction and 1 memory operation for the body of the inner loop. 
    129  
    130 {{{ 
    131 Loop  23 in main at line 64 in loop 22 
     155       Loop summary: 1 loads, 3 stores, 0 floating point operations 
     156                4 instructions, needs 45 streams for full utilization 
     157                pipelined 
     158}}} 
     159 
     160Loop 185 is the collapsed version of the work in the inner loop minus the work for the reduction.  Looking at its annotation we see that it takes 1 instruction with 1 load. 
     161 
     162{{{ 
     163Loop 185 in main at line 65 in loop 180 
    132164       Parallel section of loop from level 2 
    133        Loop summary: 1 memory operations, 0 floating point operations 
    134     1 instructions, needs 50 streams for full utilization 
    135     pipelined 
     165       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     166                1 instructions, needs 50 streams for full utilization 
     167                pipelined 
    136168}}} 
    137169 
     
    262294The output should be the same as that from the first example. 
    263295 
    264 If we look at the code using canal, we see that the compiler performed a scalar to vector expansion for end.  Like in the non-MTGL version, it performed a manhattan loop collapse, removing the reduction entirely from the loop.  Notice that all of the calls to MTGL functions have been inlined. 
     296To get the canal output issue the following command 
     297 
     298{{{ 
     299canal -pl example2.pl example2.cpp > example2.canal 
     300}}} 
     301 
     302If we look at the code using canal, we see that the compiler performed a scalar-to-vector expansion for end.  MTGL doesn't need to store the begin value since iterators in MTGL always start at 0, so it isn't expanded since it doesn't exist.  Like in the non-MTGL version, it performed a manhattan loop collapse, removing the reduction entirely from the loop, and it atomicized access to id_sum.  Notice that all of the calls to MTGL functions have been inlined. 
    265303 
    266304{{{ 
     
    273311            |     vertex_descriptor u = verts[i]; 
    274312++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined 
    275    23 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g); 
     313  185 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g); 
    276314++ function mtgl::static_graph_adapter<N1>::adjacency_iterator mtgl::adjacent_vertices<N1> inlined 
    277    20 P:e   |     size_type end = out_degree(u, g); 
     315  169 P:e   |     size_type end = out_degree(u, g); 
    278316++ function mtgl::static_graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined 
    279317            |     for (size_type j = 0; j < end; j++) 
    280318            |     { 
    281    23 PP:m  |       vertex_descriptor v = adjs[j]; 
     319  185 PP:m  |       vertex_descriptor v = adjs[j]; 
    282320++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined 
    283    23 PP:m  |       size_type vid = get(vid_map, v); 
     321  185 PP:m  |       size_type vid = get(vid_map, v); 
    284322++ function T2 mtgl::get<T1, T2, T3> inlined 
    285    23 PP:m$ |       id_sum += vid; 
     323  185 PP:m$ |       id_sum += vid; 
    286324** reduction moved out of 2 loops 
    287325            |     } 
     
    289327}}} 
    290328 
    291 Looking at the annotation for loop 20, we see that 3 instructions and 3 memory operations will be performed for the scalar to vector expansion.  This is one less instruction and memory operation than the non-MTGL version.  The reason why is that only the end scalar needs to be expanded.  Since iterators in MTGL always start at 0, the cost of the scalar to vector expansion for the beginning bound disappears. 
    292  
    293 {{{ 
    294 Loop  20 in main at line 75 in region 18 
     329Now, let's dive into the canal output.  Lucky for us, the loop numbering is the same as in the first example with all the work occurring in parallel region 145. 
     330 
     331Loop 157 shows stage 1 of the recurrence, and it is not shown in the annotated source code listing.  Looking at its annotation, we see that it takes 2 instructions with 1 load. 
     332 
     333{{{ 
     334Loop 157 in main at line 76 in region 145 
     335       Stage 1 of recurrence 
     336       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     337                2 instructions, needs 45 streams for full utilization 
     338                pipelined 
     339}}} 
     340 
     341Loop 163, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing.  Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches. 
     342 
     343{{{ 
     344Loop 163 in main in region 145 
     345       Stage 1 recurrence communication 
     346       Loop not pipelined: not structurally ok 
     347       Compiler generated 
     348       Loop summary: 11 instructions, 0 floating point operations 
     349                1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls 
     350}}} 
     351 
     352Loop 169 is stage 2 of the recurrence.  Looking at its annotation we see that it takes 3 instructions with 1 load and 2 stores.  This is one less instruction and memory operation than the non-MTGL version.  This is probably because the MTGL version doesn't need to store the begin value since iterators in MTGL always start at 0. 
     353 
     354{{{ 
     355Loop 169 in main at line 76 in region 145 
    295356       Stage 2 of recurrence 
    296        Loop summary: 3 memory operations, 0 floating point operations 
    297     3 instructions, needs 60 streams for full utilization 
    298     pipelined 
    299 }}} 
    300  
    301 Looking at the annotation for loop 23, we see that there are only 1 instruction and 1 memory operation for the body of the inner loop.  This is the same as the non-MTGL version. 
    302  
    303 {{{ 
    304 Loop  23 in main at line 74 in loop 22 
     357       Loop summary: 1 loads, 2 stores, 0 floating point operations 
     358                3 instructions, needs 60 streams for full utilization 
     359                pipelined 
     360}}} 
     361 
     362Loop 185 is the collapsed version of the work in the inner loop minus the work for the reduction.  Looking at its annotation we see that it takes 1 instruction with 1 load. 
     363 
     364{{{ 
     365Loop 185 in main at line 75 in loop 180 
    305366       Parallel section of loop from level 2 
    306        Loop summary: 1 memory operations, 0 floating point operations 
    307     1 instructions, needs 50 streams for full utilization 
    308     pipelined 
     367       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     368                1 instructions, needs 50 streams for full utilization 
     369                pipelined 
    309370}}} 
    310371 
     
    371432Note that you don't have to make any mention of the type of g when calling either of the templated functions.  The compiler is able to automatically infer the types of template parameters for template functions when those types are used in the function parameter list. 
    372433 
    373 Now lets look at the canal output for the function compute_id_sum().  We see that the compiler performed a manhattan loop collapse, removing the reduction entirely from the loop, just like before. Notice that all of the calls to MTGL functions are still inlined.  But in this version, the compiler didn't perform a scalar to vector expansion for end.  It appears to have completely removed the cost for this operation.  This is not due to anything special in MTGL, however, it is due to the compiler.  If we put the id sum code from example1.cpp into a function, the compiler would do the same thing.  We're not sure why the compiler seems to perform better when the code is in a function versus directly in main(). 
     434Now let's look at the canal output for the function compute_id_sum().  We see that the compiler performed a manhattan loop collapse, removing the reduction entirely from the loop, just like before.  The compiler also performed a scalar-to-vector expansion for end and atomicized the access to id_sum. 
    374435 
    375436{{{ 
     
    382443            |     vertex_descriptor u = verts[i]; 
    383444++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined 
    384    12 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g); 
     445   99 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g); 
    385446++ function mtgl::static_graph_adapter<N1>::adjacency_iterator mtgl::adjacent_vertices<N1> inlined 
    386             |     size_type end = out_degree(u, g); 
     447   61 P:e   |     size_type end = out_degree(u, g); 
    387448++ function mtgl::static_graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined 
    388449            |     for (size_type j = 0; j < end; j++) 
    389450            |     { 
    390    12 PP:m  |       vertex_descriptor v = adjs[j]; 
     451   99 PP:m  |       vertex_descriptor v = adjs[j]; 
    391452++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined 
    392    12 PP:m  |       size_type vid = get(vid_map, v); 
     453   99 PP:m  |       size_type vid = get(vid_map, v); 
    393454++ function T2 mtgl::get<T1, T2, T3> inlined 
    394    12 PP:m$ |       id_sum += vid; 
     455   99 PP:m$ |       id_sum += vid; 
    395456** reduction moved out of 2 loops 
    396457            |     } 
     
    398459}}} 
    399460 
    400 Looking at the annotation for loop 12, we see that there's only 1 instruction and 1 memory operation for the body of the inner loop.  This is the same as before. 
    401  
    402 {{{ 
    403 Loop  12 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) 
    404 [with T1=mtgl::static_graph_adapter<(int)1>] at line 59 in loop 11 
     461In this case the loop numbering assigned by the compiler is different.  The important parallel region is 3. 
     462 
     463Loop 35 is stage 1 of the recurrence, and it is not shown in the annotated source code listing.  Looking at its annotation, we see that it takes 2 instructions with 1 load. 
     464 
     465{{{ 
     466Loop  35 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     467at line 60 in region 3 
     468       Stage 1 of recurrence 
     469       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     470                2 instructions, needs 45 streams for full utilization 
     471                pipelined 
     472}}} 
     473 
     474Loop 48, which is the communication for stage 1 of the recurrence, is not shown in the annotated source code listing.  Looking at its annotation, we see that it takes 11 instructions with 1 load, 1 store, and 3 branches. 
     475 
     476{{{ 
     477Loop  48 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     478in region 3 
     479       Stage 1 recurrence communication 
     480       Loop not pipelined: not structurally ok 
     481       Compiler generated 
     482       Loop summary: 11 instructions, 0 floating point operations 
     483                1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls 
     484}}} 
     485 
     486Loop 61 is stage 2 of the recurrence.  Looking at its annotation we see that it takes 3 instructions with 1 load and 2 stores. 
     487 
     488{{{ 
     489Loop  61 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     490at line 60 in region 3 
     491       Stage 2 of recurrence 
     492       Loop summary: 1 loads, 2 stores, 0 floating point operations 
     493                3 instructions, needs 60 streams for full utilization 
     494                pipelined 
     495}}} 
     496 
     497Loop 99 is the collapsed version of the work in the inner loop minus the work for the reduction.  Looking at its annotation we see that it takes 1 instruction with 1 load. 
     498 
     499{{{ 
     500Loop  99 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     501at line 59 in loop 87 
    405502       Parallel section of loop from level 2 
    406        Loop summary: 1 memory operations, 0 floating point operations 
    407     1 instructions, needs 50 streams for full utilization 
    408     pipelined 
    409 }}} 
     503       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     504                1 instructions, needs 50 streams for full utilization 
     505                pipelined 
     506}}} 
     507 
     508The costs of all of these loops are identical to the ones in example 2. 
    410509 
    411510= 4. Calling the Templated Function on Different Graph Types = 
    412511 
    413 We have a generic version of the id sum code that can be used on multiple graph adapter types.  Let's do that now.  The complete code for this example is at [source:trunk/tutorial/example4.cpp tutorial/example4.cpp].  We will be using another adapter provided by MTGL, [source:trunk/mtgl/graph_adapter.hpp graph_adapter].  The graph_adapter is a mutable adjacency list graph implementation.  To utilize graph_adapter, we must include it like this 
    414  
    415 {{{ 
    416 #include <mtgl/graph_adapter.hpp> 
     512We have a generic version of the id sum code that can be used on multiple graph adapter types.  Let's do that now.  The complete code for this example is at [source:trunk/tutorial/example4.cpp tutorial/example4.cpp].  We will be using another adapter provided by MTGL, [source:trunk/mtgl/adjacency_list_adapter.hpp adjacency_list_adapter].  The adjacency_list_adapter is a mutable adjacency list graph implementation.  To utilize adjacency_list_adapter, we must include it like this 
     513 
     514{{{ 
     515#include <mtgl/adjacency_list_adapter.hpp> 
    417516}}} 
    418517 
     
    421520{{{ 
    422521  typedef static_graph_adapter<directedS> Graph; 
    423   typedef graph_adapter<directedS> DynamicGraph; 
     522  typedef adjacency_list_adapter<directedS> DynamicGraph; 
    424523  typedef graph_traits<Graph>::size_type size_type; 
    425524 
     
    469568}}} 
    470569 
    471 Like in the previous examples the code reads the graph from standard in.  Then the code creates and initializes two graphs from the input.  The graph g is of type static_graph_adapter<directedS>, and the graph dg is of type graph_adapter<directedS>.  Each of the graphs is printed using the same generic function print_my_graph(), and their id sums are computed using the same generic function compute_id_sum(). 
    472  
    473 Because compute_id_sum() is called for two different graphs in this example, canal merges the annotations for both calls into the function.  This canal output shows an important point.  Different graph adapters have different performance.  The static_graph_adapter has been optimized to be as fast as possible and parallel as well as possible.  Part of this speed and good parallelization comes at the cost of not having a mutable graph.  The graph_adapter does not perform as well, but it offers more functionality, namely mutability.  The graph adapter API has been designed to allow as much speed and parallelism as possible in the graph adapters.  However, the graph adapter must still be implemented efficiently. 
    474  
    475 If you look at the loop annotations further below, you will see that loop 20 is for static_graph_adapter and loops 14 and 21 are for graph_adapter.  For both versions all of the MTGL function calls are still inlined, and both versions perform a manhattan loop collapse.  Both versions also remove the reduction completely out of the loop.  However, the graph_adapter version needed to perform additional work for the line accessing the vertex_iterator verts.  This line wasn't part of the manhattan loop collapse. 
    476  
    477 {{{ 
    478             |   vertex_id_map<Graph> vid_map = get(_vertex_id_map, g); 
    479 ++ function mtgl::vertex_id_map<T1> mtgl::get<T1> inlined 
    480 ++ function mtgl::vertex_id_map<T1> mtgl::get<T1> inlined 
    481             |  
     570Like in the previous examples the code reads the graph from standard in.  Then the code creates and initializes two graphs from the input.  The graph g is of type static_graph_adapter<directedS>, and the graph dg is of type adjacency_list_adapter<directedS>.  Each of the graphs is printed using the same generic function print_my_graph(), and their id sums are computed using the same generic function compute_id_sum(). 
     571 
     572Because compute_id_sum() is called for two different graphs in this example, canal merges the annotations for both calls into the function.  This canal output shows an important point.  Different graph adapters have different performance.  The static_graph_adapter has been optimized to be as fast as possible and parallelize as well as possible.  Part of this speed and good parallelization comes at the cost of not having mutability.  The adjacency_list_adapter does not perform as well, but it offers more functionality, namely mutability.  The graph adapter API has been designed to allow as much speed and parallelism as possible in the graph adapters.  However, the graph adapter must still be implemented efficiently. 
     573 
     574 
     575If you look at the loop annotations further below, you will see that loops 45, 62, 79, and 123 in parallel region 4 are for static_graph_adapter.  Loops 46, 63, 80, and 124 in parallel region 5 are for adjacency_list_adapter.  For both versions all of the MTGL function calls are still inlined, and both versions perform a manhattan loop collapse.  Both versions also remove the reduction completely out of the loop and atomicize access to id_sum.  As we will see below, however, the adjacency_list_adapter version performs more work than the static_graph_adapter version. 
     576 
     577{{{ 
    482578            |   vertex_iterator verts = vertices(g); 
    483 ++ function mtgl::graph_adapter<N1>::vertex_iterator mtgl::vertices<N1> inlined 
     579++ function mtgl::adjacency_list_adapter<N1>::vertex_iterator mtgl::vertices<N1> inlined 
    484580++ function mtgl::static_graph_adapter<N1>::vertex_iterator mtgl::vertices<N1> inlined 
    485581            |   size_type order = num_vertices(g); 
    486 ++ function mtgl::graph_adapter<N1>::size_type mtgl::num_vertices<N1> inlined 
     582++ function mtgl::adjacency_list_adapter<N1>::size_type mtgl::num_vertices<N1> inlined 
    487583++ function mtgl::static_graph_adapter<N1>::size_type mtgl::num_vertices<N1> inlined 
    488584            |   for (size_type i = 0; i < order; i++) 
    489585            |   { 
    490    14 P     |     vertex_descriptor u = verts[i]; 
    491 ++ function T1 mtgl::g_vertex_iterator<T1, T2>::operator [] inlined 
     586   80 P     |     vertex_descriptor u = verts[i]; 
     587   46 P:e   + 
    492588++ function T1 mtgl::static_vertex_iterator<T1, T2>::operator [] inlined 
    493    14 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g); 
    494 ++ function mtgl::graph_adapter<N1>::adjacency_iterator mtgl::adjacent_vertices<N1> inlined 
     589   80 P:m   |     adjacency_iterator adjs = adjacent_vertices(u, g); 
     590++ function mtgl::adjacency_list_adapter<N1>::adjacency_iterator mtgl::adjacent_vertices<N1> inlined 
    495591++ function mtgl::static_graph_adapter<N1>::adjacency_iterator mtgl::adjacent_vertices<N1> inlined 
    496             |     size_type end = out_degree(u, g); 
    497 ++ function mtgl::graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined 
     592   79 P:e   |     size_type end = out_degree(u, g); 
     593++ function mtgl::graph_traits<mtgl::adjacency_list_adapter<N1>>::size_type mtgl::out_degree<N1> inlined 
    498594++ function mtgl::static_graph_adapter<N1>::size_type mtgl::out_degree<N1> inlined 
    499595            |     for (size_type j = 0; j < end; j++) 
    500596            |     { 
    501    21 PP:m  |       vertex_descriptor v = adjs[j]; 
    502 ++ function T1 mtgl::g_adjacency_iterator<T1, T2>::operator [] inlined 
     597  124 PP:m  |       vertex_descriptor v = adjs[j]; 
     598++ function T1::Vertex *mtgl::al_adjacency_iterator<T1>::operator [] inlined 
    503599++ function T1 mtgl::static_adjacency_iterator<T1, T2>::operator [] inlined 
    504    20 PP:m  |       size_type vid = get(vid_map, v); 
     600  124 PP:m  |       size_type vid = get(vid_map, v); 
    505601++ function T2 mtgl::get<T1, T2, T3> inlined 
    506602++ function T2 mtgl::get<T1, T2, T3> inlined 
    507    21 PP:m$ |       id_sum += vid; 
    508    20 PP:m$ + 
     603  124 PP:m$ |       id_sum += vid; 
     604  123 PP:m$ + 
    509605** reduction moved out of 2 loops 
    510606            |     } 
     
    512608}}} 
    513609 
    514 There's nothing new about the annotation for loop 20.  The function compute_id_sum() still performs the same for static_graph_adapter. 
    515  
    516 {{{ 
    517 Loop  20 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) 
    518 [with T1=mtgl::static_graph_adapter<(int)1>] at line 60 in loop 18 
     610Let's do a loop-by-loop comparison of the work required for each adapter.  We'll look first at loops 45 and 46 which perform stage 1 of the recurrence.  Loop 45 is for static_graph_adapter, and it performs 2 instructions with 1 load.  Loop 46 is for adjacency_list_adapter, and it performs 2 instructions with 2 loads. 
     611 
     612{{{ 
     613Loop  45 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     614at line 61 in region 4 
     615       Stage 1 of recurrence 
     616       Loop summary: 1 loads, 0 stores, 0 floating point operations 
     617                2 instructions, needs 45 streams for full utilization 
     618                pipelined 
     619 
     620Loop  46 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] 
     621at line 59 in region 5 
     622       Stage 1 of recurrence 
     623       Loop summary: 2 loads, 0 stores, 0 floating point operations 
     624                2 instructions, needs 45 streams for full utilization 
     625                pipelined 
     626}}} 
     627 
     628Loops 62 and 63 perform the communication phase of stage 1 of the recurrence.  Loop 62 is for static_graph_adapter, and loop 63 is for adjacency_list_adapter.  They both perform 11 instructions with 1 load, 1 store, and 3 branches. 
     629 
     630{{{ 
     631Loop  62 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     632in region 4 
     633       Stage 1 recurrence communication 
     634       Loop not pipelined: not structurally ok 
     635       Compiler generated 
     636       Loop summary: 11 instructions, 0 floating point operations 
     637                1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls 
     638 
     639Loop  63 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] 
     640in region 5 
     641       Stage 1 recurrence communication 
     642       Loop not pipelined: not structurally ok 
     643       Compiler generated 
     644       Loop summary: 11 instructions, 0 floating point operations 
     645                1 loads, 1 stores, 0 reloads, 0 spills, 3 branches, 0 calls 
     646}}} 
     647 
     648Loops 79 and 80 perform stage 2 of the recurrence.  Loop 79 is for static_graph_adapter, and it performs 3 instructions with 1 load and 2 stores.  Loop 80 is for adjacency_list_adapter, and it performs 5 instructions with 2 loads and 3 stores. 
     649 
     650{{{ 
     651Loop  79 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     652at line 61 in region 4 
     653       Stage 2 of recurrence 
     654       Loop summary: 1 loads, 2 stores, 0 floating point operations 
     655                3 instructions, needs 60 streams for full utilization 
     656                pipelined 
     657 
     658Loop  80 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] 
     659at line 59 in region 5 
     660       Stage 2 of recurrence 
     661       Loop summary: 2 loads, 3 stores, 0 floating point operations 
     662                5 instructions, needs 46 streams for full utilization 
     663                pipelined 
     664}}} 
     665 
     666Loops 123 and 124 are the collapsed versions of the work in the inner loop minus the work for the reduction.  Loop 123 is for static_graph_adapter, and it performs 1 instruction with 1 load.  Loop 124 is for adjacency_list_adapter, and it performs 7 instructions with 7 loads. 
     667 
     668{{{ 
     669Loop 123 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::static_graph_adapter<(int)1>] 
     670at line 60 in loop 109 
    519671       Parallel section of loop from level 2 
    520        Loop summary: 1 memory operations, 0 floating point operations 
     672       Loop summary: 1 loads, 0 stores, 0 floating point operations 
    521673                1 instructions, needs 50 streams for full utilization 
    522674                pipelined 
    523 }}} 
    524  
    525 We see how the compiler handled graph_adapter by looking at the annotations for loops 14 and 21.  10 instructions and 10 memory operations will be performed for the body of the outer loop (14), while 5 instructions will be performed for the body of the inner loop (21).  This much higher number of instructions is due to the graph_adapter iterators being more expensive than the static_graph_adapter iterators. 
    526  
    527 {{{ 
    528 Loop  14 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) 
    529 [with T1=mtgl::graph_adapter<(int)1>] at line 59 in loop 9 
    530        Stage 2 of recurrence 
    531        Loop summary: 10 memory operations, 0 floating point operations 
    532                 10 instructions, needs 45 streams for full utilization 
    533                 pipelined 
    534  
    535 Loop  21 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) 
    536 [with T1=mtgl::graph_adapter<(int)1>] at line 64 in loop 19 
     675 
     676Loop 124 in mtgl::graph_traits<T1>::size_type compute_id_sum<T1>(T1 &) [with T1=mtgl::adjacency_list_adapter<(int)1>] 
     677at line 64 in loop 110 
    537678       Parallel section of loop from level 2 
    538        Loop summary: 5 memory operations, 0 floating point operations 
    539                 5 instructions, needs 46 streams for full utilization 
    540                 pipelined 
    541 }}} 
     679       Loop summary: 7 loads, 0 stores, 0 floating point operations 
     680                7 instructions, needs 46 streams for full utilization 
     681                pipelined 
     682}}} 
     683 
     684In total, the static_graph_adapter version performed 17 instructions with 4 loads, 3 stores, and 3 branches.  The adjacency_list_adapter version performed 25 instructions with 12 loads, 4 stores, and 3 branches.  We see that there is a cost associated with using an adjacency list graph vs. a compressed sparse row graph.